1.3.6.10 Patterns Profiler
The Patterns Profiler analyzes data values in any number of String attributes and assigns them patterns according to the sequence of character types. For example, the value "10 Lowestoft Lane" is assigned a pattern of NN_aaaaaaaaa_aaaa, using the default Pattern Map reference list.
Note:
The default *Base Tokenization map is designed for use with Latin-1 encoded data, as are the alternative *Unicode Base Tokenization and *Unicode Character Pattern maps. If these maps are not suited to the character-encoding of the data, it is possible to create and use a new one to take account of, for example, multi-byte Unicode (hexadecimal) character references.
The profiler then counts up the number of times each pattern occurs in each attribute, and presents its results.
Use the Patterns Profiler to uncover the patterns in your data, and to create reference lists of valid and invalid patterns that can be used to validate the data on an ongoing basis, using a Check Pattern processor.
The following tables describe the configuration options:
Configuration | Description |
---|---|
Inputs |
Specify any String attributes that you want to analyze for data patterns. |
Options |
Specify the following option:
|
The default Standard Pattern Map maps characters as follows:
Character Type | Representation in Pattern |
---|---|
Alpha characters (a-z, or A-Z) |
a |
Number characters (0-9) |
N |
Punctuation characters, such as semi-colons, commas |
Represented as they are. |
Control characters (for example, carriage returns) |
C |
Space |
_ |
Characters that are not recognized by the Character Pattern Map are represented with a question mark (?) in each pattern.
You can use a different Character Pattern Map to map characters as you want - for example to represent unusual letters such as x and z differently from more common letters.
Configuration | Description |
---|---|
Outputs |
Describes any data attribute or flag attribute outputs. |
Data Attributes |
None. |
Flags |
The following flag is output:
|
The following table describes the statistics produced by the profiler for each attribute it analyzes:
Statistic | Description |
---|---|
Pattern |
The generated pattern for each value. |
Length |
The length of each generated pattern; that is, the number of characters in each value. |
Count |
The number of records with values in the attribute that matched the pattern. |
% |
The percentage of records with values in the attribute that matched the pattern. |
Example
In this example, the Patterns Profiler is used to analyze patterns in all attributes of a table of Customer records. For each attribute, the following type of view is generated:
Pattern | Length | Count | % |
---|---|---|---|
NN-NNNNN-aa |
11 |
1681 |
84.0 |
N-NNNN-aa |
10 |
310 |
15.5 |
aa-NNNNN-aa |
11 |
4 |
0.2 |
NN-NNN-aa |
9 |
2 |
<0.1 |
NN-N-aa |
7 |
1 |
<0.1 |
NN-NNNNN-Na |
11 |
1 |
<0.1 |
[Null] |
10 |
1 |
<0.1 |
NN-NNNNN |
9 |
1 |
<0.1 |
By sorting the view by the Count column, you can quickly find the most common and least common patterns in the data, enabling you to construct valid and invalid patterns lists for use in a Pattern Check.