1.3.3.8.16 Comparison: Longest Common Substring Sum Percentage
The Longest Common Substring Sum Percentage comparison offers a powerful way of determining the similarity between two String/String Array values, particularly where those values contain long strings of characters, or many words.
The Longest Common Substring Sum Percentage (LCSSP) calculates the Longest Common Substring Sum between two string values, and then relates it to the number of characters in either the longer or the shorter string being compared.
The Longest Common Substring Sum Percentage comparison is particularly useful when matching multi-word text strings where both word order and whitespace differences exist, and where you want to consider the similarity of the strings in proportion to their length.
This can happen for example when matching Asian names from different sources, which may not consistently represent names in the same order, and where whitespace may differ because of transliteration differences, or typos. Note that whitespace differences will weaken the results of word matching comparisons (such as Word Match Percentage) as these rely on words being consistently separated.
For example, consider the following names:
Mary Elizabeth Angus
Mary Elizabeth Francis
Mary Elizabeth
Xiaojian Zhong
ZHONG Xiao Jian
The last two names are a strong match, even though they are different in both word order and spacing. They will not have a strong Word Match Percentage result. They have a strong Longest Common Substring Sum result, but so do the first two names, and these are not a strong match.
Longest Common Substring Sum Percentage offers a way of considering the total length of common substrings between two values and relating that to the total number of characters being considered.
This comparison supports the use of result bands.
The following table describes the configuration options:
Option | Type | Description | Default Value |
---|---|---|---|
Match No Data pairs? |
Yes/No |
This option determines the result of a comparison when it compares two No Data (Null, or containing only whitespace characters) values for an identifier. If set to No, the comparison will give a 'no data' result when comparing a No Data value against another No Data value. If set to Yes, the comparison will give a result of 0 when comparing a No Data value against another No Data value. A 'no data' result will only be returned if a No Data value is compared against a populated value. |
No |
Ignore case? |
Yes/No |
Sets whether or not to ignore case when comparing values. |
Yes |
Include substrings greater than length |
Yes/No |
Common substrings between the two values being compared must be greater than the specified value to contribute to the overall Longest Common Substring Sum score. If set to 3, distinct (non over-lapping) substrings of 4 or more characters that are common between two values will be included in the LCSS calculation. For example, the values "Acme Micros Ltd Serv" and "Acme and Partners Micro Services Ltd" would give an LCSS of 9, assuming whitespace is trimmed before comparing. This would be calculated as 4 characters for the common substring "Acme", and 5 characters for the common substring "Micro". Note that the common substring "Ltd" would not be included in the calculation as its length is not greater than 3 characters. |
4 |
Relate to shorter input? |
Yes/No |
Sets whether to relate the Longest Common Substring Sum to the shorter or the longer of the two strings being compared. Relating to the shorter input allows for looser matching, where the majority of substrings in the shorter string are also found in the longer string, but allows the longer string to contain extra data. |
No |
Example
In this example, the Longest Common Substring Sum Percentage comparison is used when comparing full names.
The following options are specified:
-
Match No Data pairs? = No
-
Ignore case? = Yes
-
Include substrings greater than length = 3
-
Relate to shorter input? = No
A Trim Whitespace transformation is used to remove all whitespace from values before they are compared.
Example results
With the above configuration, the following table illustrates some comparison results:
Table 1-53 Example Results: Longest Common Substring Sum Percentage
Value A | Value B | Comparison Result |
---|---|---|
Mary Elizabeth Angus |
Mary Elizabeth Francis |
65 |
Xiaojian Zhong |
ZHONG Xiao Jian |
100 |
Mary Elizabeth Angus |
Mary Elizabeth |
72 |
Tan Tan WONG |
WONG Tantan |
100 |
James Patrick Robinson |
Robin Patrick Jameson |
85 |