Patterns Profiler

1.3.6.10 Patterns Profiler

The Patterns Profiler analyzes data values in any number of String attributes and assigns them patterns according to the sequence of character types. For example, the value "10 Lowestoft Lane" is assigned a pattern of NN_aaaaaaaaa_aaaa, using the default Pattern Map reference list.

Note:

The default *Base Tokenization map is designed for use with Latin-1 encoded data, as are the alternative *Unicode Base Tokenization and *Unicode Character Pattern maps. If these maps are not suited to the character-encoding of the data, it is possible to create and use a new one to take account of, for example, multi-byte Unicode (hexadecimal) character references.

The profiler then counts up the number of times each pattern occurs in each attribute, and presents its results.

Use the Patterns Profiler to uncover the patterns in your data, and to create reference lists of valid and invalid patterns that can be used to validate the data on an ongoing basis, using a Check Pattern processor.

The following tables describe the configuration options:

Configuration Description

Configuration	Description
Inputs	Specify any String attributes that you want to analyze for data patterns.
Options	Specify the following option: `Character Pattern Map`: maps each character to a pattern character. Specified as Reference Data (Pattern Generation Category). Default value is *Character Pattern Map.

Inputs

Specify any String attributes that you want to analyze for data patterns.

Options

Specify the following option:

Character Pattern Map: maps each character to a pattern character. Specified as Reference Data (Pattern Generation Category). Default value is *Character Pattern Map.

The default Standard Pattern Map maps characters as follows:

Character Type	Representation in Pattern
Alpha characters (a-z, or A-Z)	a
Number characters (0-9)	N
Punctuation characters, such as semi-colons, commas	Represented as they are.
Control characters (for example, carriage returns)	C
Space	_

Characters that are not recognized by the Character Pattern Map are represented with a question mark (?) in each pattern.

You can use a different Character Pattern Map to map characters as you want - for example to represent unusual letters such as x and z differently from more common letters.

Configuration Description

Configuration	Description
Outputs	Describes any data attribute or flag attribute outputs.
Data Attributes	None.
Flags	The following flag is output: `[Attribute name].Pattern`: indicates the pattern of the attribute. Possible values are the patterns defined by the Pattern Map Reference data.

Outputs

Describes any data attribute or flag attribute outputs.

Data Attributes

None.

Flags

The following flag is output:

[Attribute name].Pattern: indicates the pattern of the attribute. Possible values are the patterns defined by the Pattern Map Reference data.

The following table describes the statistics produced by the profiler for each attribute it analyzes:

Statistic	Description
Pattern	The generated pattern for each value.
Length	The length of each generated pattern; that is, the number of characters in each value.
Count	The number of records with values in the attribute that matched the pattern.
%	The percentage of records with values in the attribute that matched the pattern.

Example

In this example, the Patterns Profiler is used to analyze patterns in all attributes of a table of Customer records. For each attribute, the following type of view is generated:

Pattern	Length	Count	%
NN-NNNNN-aa	11	1681	84.0
N-NNNN-aa	10	310	15.5
aa-NNNNN-aa	11	4	0.2
NN-NNN-aa	9	2	<0.1
NN-N-aa	7	1	<0.1
NN-NNNNN-Na	11	1	<0.1
[Null]	10	1	<0.1
NN-NNNNN	9	1	<0.1

By sorting the view by the Count column, you can quickly find the most common and least common patterns in the data, enabling you to construct valid and invalid patterns lists for use in a Pattern Check.