1.3.6.11 Quickstats Profiler
The Quickstats Profiler provides fundamental quality metrics for a number of records or transactions, highlighting:
-
Candidate key columns
-
Completeness and missing data
-
Duplication
-
Uniqueness and diversity of values
Each input attribute is profiled individually.
Quickstats is useful to establish a picture of some of the fundamentals of data and its quality.
Often documentation and meta-data information are missing, incomplete, out of date, or not trusted. It is important to produce an unequivocal picture of the data, from the data itself, so that mistakes are not made inadvertently through false assumptions.
The following table describes the configuration options:
Configuration | Description |
---|---|
Inputs |
Specify any attributes from which you want to obtain quick profiling statistics. |
Options |
None. |
Outputs |
Describes any data attribute or flag attribute outputs. |
Data Attributes |
None. |
Flags |
The following flag is output:
|
The Quickstats Profiler requires a batch of records to produce its statistics (for example, in order to tell how many duplicate values there are for each attribute analyzed). It must therefore run to completion before its results are available, and is not suitable for a process that requires a real time response.
When executed against a batch of transactions from a real time data source, it will finish its processing when the commit point (transaction or time limit) configured on the Read Processor is reached.
The following table describes the statistics produced by the profiler for each attribute:
Statistic | Description |
---|---|
With data |
The number of records with data in that attribute. |
Without data |
The number of records without data in that attribute. This includes records that had a NULL value, and those that contained other types of No Data, such as only white space or non-printing characters. Drill down on the number to see a breakdown of the types of No Data found. |
Singletons |
The number of records with values that were found only once in that attribute. |
Duplicates |
The number of records with values that were found more than once in that attribute. |
Distinct |
he number of different values that were found in the attribute. Drill down on the number to see a breakdown of these values by their frequency of occurrence. |
Comments |
Automated comments based on the findings of the Quickstats profiler. See below. |
Clicking on the Additional Information button will show the above statistics as percentages of the total number of records analyzed.
Automated Comments
Automated Comments are generated in order to highlight potential areas of interest in the data. For example:
-
Where an attribute is 100% complete and unique, it is identified as a possible key
-
Where an attribute is nearly 100% complete and unique it is highlighted as a possibly damaged key
-
Where an attribute is nearly 100% complete (suggesting blanks are not expected), the comment prompts the user to investigate nulls
-
Where an attribute is nearly 100% unique (suggesting duplicates are not expected), the comment prompts the user to investigate duplicates
-
Where an attribute has only one distinct value, the comment suggests that the attribute may be redundant
Note that where many of the above comments apply, the comments are concatenated.
Example
In this example, the Quickstats Profiler is used to gain an initial overview of a table of Customer records.
Table 1-123 Quickstats Profiler Example
Input Field | Record Total | With Data | Without Data | Singletons | Duplications | Distinct Values |
---|---|---|---|---|---|---|
CU_NO |
2001 |
2000 |
1 |
1997 |
3 |
1998 |
CU_ACCOUNT |
2001 |
2000 |
1 |
2000 |
0 |
2000 |
TITLE |
2001 |
1862 |
139 |
3 |
1859 |
8 |
NAME |
2001 |
2000 |
1 |
1980 |
20 |
1990 |
GENDER |
2001 |
1853 |
148 |
0 |
1853 |
2 |
BUSINESS |
2001 |
1670 |
331 |
1629 |
41 |
1649 |
ADDRESS1 |
2001 |
1999 |
2 |
1926 |
73 |
1954 |
ADDRESS2 |
2001 |
1921 |
80 |
554 |
1367 |
839 |
ADDRESS3 |
2001 |
1032 |
969 |
278 |
754 |
379 |
POSTCODE |
2001 |
1762 |
239 |
1604 |
158 |
1672 |
AREA_CODE |
2001 |
1884 |
117 |
64 |
1820 |
270 |
TEL_NO |
2001 |
1994 |
7 |
1875 |
119 |
1934 |
|
2001 |
1936 |
65 |
1904 |
32 |
1920 |
ACC_MGR |
2001 |
1996 |
5 |
0 |
1996 |
30 |
DT_PURCHASED |
2001 |
1998 |
3 |
1090 |
908 |
1499 |
DT_ACC_OPEN |
2001 |
1998 |
3 |
1093 |
905 |
1500 |
DT_LAST_PAYMENT |
2001 |
1997 |
4 |
1026 |
971 |
1425 |
DT_LAST_PO_RAISED |
2001 |
1998 |
3 |
1003 |
995 |
1433 |
BALANCE |
2001 |
1999 |
2 |
7 |
1992 |
10 |
In most cases, drilling down on the numbers in the Summary View will take you directly to the records. However, some numbers take you to an interim view.
-
If you drill down on the 41 duplicate BUSINESS values in the Summary View, EDQ shows the frequency of each duplicate value.
-
If you drill down on the 8 distinct TITLE values in the Summary View, EDQ shows the frequency of each distinct value.
-
If you drill down on the 239 POSTCODE values without any data in the Summary View, EDQ shows a summary view of the different types of No Data found (though note that all of these will be Null values if the default No Data Handling Reference Data map is used in the Snapshot).