1.3.3.8.13 Comparison: Longest Common Substring
The Longest Common Substring comparison compares two String/String Array values and determines whether they might match by determining the longest length of a sequence of characters (substring) that is common to both values, whether that substring represents the whole or a part of the String value.
Use the Longest Common Substring comparison to find matches between String values where there may be 'noise' either at the beginning or the end of String that is difficult to ignore in a comparison by stripping words, or where you know that String values with a common sequence of characters over a certain length are likely to be related, for example, to match "Nomura Securities Co., Ltd." with "Nomura Investor Relations Co., Ltd." with a Longest Common Substring of 6 characters "Nomura".
The Longest Common Substring comparison is often used in match rules that are low down in the decision table in order to find and review possible matches that have similarity but which have failed to match using other rules, perhaps due to ordering issues, or due to excess 'noise'.
This comparison supports the use of result bands.
The following table describes the configuration options:
Option | Type | Description | Default Value |
---|---|---|---|
Match No Data pairs? |
Yes/No |
This option determines the result of a comparison when it compares two No Data (Null, or containing only whitespace characters) values for an identifier. If set to No, the comparison will give a 'no data' result when comparing a No Data value against another No Data value. If set to Yes, the comparison will give a result of 0 when comparing a No Data value against another No Data value. A 'no data' result will only be returned if a No Data value is compared against a populated value. |
No |
Ignore case? |
Yes/No |
Sets whether or not to ignore case when comparing values. |
Yes |
Example
In this example, the Longest Common Substring comparison is used to identify possible matches in customer names.
The following options are specified:
-
Match No Data pairs? = No
-
Ignore case? = Yes
A Trim Whitespace transformation is used to remove all whitespace from values before they are compared.
Example results
With the above configuration, the following table illustrates some comparison results:
Table 1-50 Example Results: Longest Common Substring
Value A | Value B | Comparison Result |
---|---|---|
Jill Lewis |
Jill Lewis-Thompson |
9 |
Jill Lewis |
Bill Lewis |
8 |
Jill Lewis |
Jill Lonerghan |
5 |
Michael Davis **DO NOT CALL** |
Michael Davis |
12 |
Tom Featherstone ----DECEASED---- |
Thomas David Featherstone |
12 |
Tom Featherstone |
John Feathers |
8 |