Databases search statistics
This is part of the submitted UseCaseList.
Scenario
A description of the scenario that you have in mind.
The user wants to have some dynamically generated statistical information about the data he is interacting with.
Importance
How important do you see this use case as?
Dependencies
What other use cases are affected by the implementation of this one?
- Databases usability / db integration / dating / authentication / cartography / bibliography
Input
Things that the user must/might supply to the system.
- Search performed by user.
Output
Things that the user will recieve in response to their request.
- Statistical information on the searched population and answer population.
Difficulties
Areas in which you foresee problems/issues arising.
- Apart from a simple figure giving the number of matches to a request, there are no statistical functionalities implemented in the databases.
- Extent of implementation - It's clear that the user should be able to do statistical analysis of the data we provide him. However it's not our job to implement complex statistical functionalities. So he should be able to download the figures that he can process with his own tools. Some requests might generate huge files (max. 120.000 lines x c. 10 columns). Therefore there some basic (most frequently used) statistics functionalities should be implemented.
- Specifications -
- General specifications
- The statistics can be pre-generated for the databases as a whole and updated only if there are changes in the data content. However a mechanisms for updating the statistics has to exist.
- For subsets of the whole data came into being by the specificity of the user requests, the statistics have to be generated dynamically, on-the-fly, each time anew for each different request.
- The statistics should be available as (1) figures, (2) percentages, (3) diagrams, (4) data files (sets of figures from which the statistics figures were generated and which can serve the user for further analysis).
- Statistics on the databases as a whole
- Number of databases available for request out of the total number of databases that Bernstein can address at the time of the user request.
- Amount of watermark files in all databases and in each individual database
- Time distribution of watermark files
- Time distribution of watermark files with combinations of metadata fields as selected by the user. It will show the differences in the content of the files. E.g. There are 210 files from year 1500, with 153 giving the measurement of the chain lines, 201 providing laid lines density, 145 giving the location where the watermark was used, etc. Note: It can become more complicated (especially to visualize) if the user wants to intersect these data. Say 'all files with dated 1500, having information on the chain lines AND laid lines measurements OR locations'.
- Basic functionalities to implement - Mean, range, skew, standard deviation, variance. These measure should be performable on the entire data range or a subset trimmed from outliers (say between 25% and 75%).
- Statistics on the individual data sets requested by users
- Time distribution of requested watermark files
- Time distribution of requested watermark files with combinations of metadata fields as selected by the user. (See above)
- Basic functionalities to implement - Mean, range, skew, standard deviation, variance. (See above)
Example
An example supporting this use case.
Other Information
Any other information that you think is important to include.
Comments
Comment from other partner regarding the use case.
--
VladAtanasiu - 13 Sep 2006