Comparison: Longest Common Substring Sum
The Longest Common Substring Sum comparison offers a powerful way of determining the similarity between two String/String Array values, particularly where those values contain long strings of characters, or many words.
The Longest Common Substring Sum (LCSS) is calculated as the length, in characters, of the longest common substring shared by the two values, plus the lengths of all other non-overlapping common substrings. A minimum substring length, in characters, is specified as an option of the comparison. The comparison does not depend on the order in which the substrings are found in each string value.
Note that this is not necessarily the same as the largest possible sum of common substring lengths.
When comparing two strings, it is sometimes possible to construct several different sets of non-overlapping substrings. The Longest Common Substring Sum comparison will always ensure that it uses a set which includes the longest common substring shared between the two values, even if this does not result in the largest possible match score.
Use the Longest Common Substring Sum comparison to find fuzzy matches between String values where the data values generally contain a large number of characters or words, but where typos or other variations (for example, extra words or abbreviations in either value) may exist. For example, data such as company names with long potential values may be stored in a fixed length field, leading to users abbreviating certain words. When matching against other systems without such issues, matches can be difficult to find. However, a Longest Common Substring Sum between values such as "Kingfisher Computer Services and Technology Limited" and "Kingfisher Comp Servs & Tech Ltd." will give a matching score as high as 23 characters, indicating a strong match, if the Minimum String length property is set to 4, as the distinct Strings "Kingfisher Comp" (15 characters) ,"Serv" (4 characters) and "Tech" (4 characters) will all match.
Note that as substrings must not overlap, the String "Kingfisher Comp" is counted only once, and substrings of it that are 4 characters or above (such as "King", "Kingf", Kingfi", "ingfi" etc.) are not counted.
If a substring is found in both values, and is long enough, the order in which it is found compared to other substrings is irrelevant. For example, the strings "Kingfisher Servs & Tech" will match "Kingfisher Tech & Servs" with a score of 20 (composed of the substrings "Kingfisher " (11 characters including the space), "Tech" (4 characters), and "Servs" (5 characters), assuming the Minimum String length property is set to 4.
This comparison supports the use of result bands.
The following table describes the configuration options:
Option | Type | Description | Default Value |
---|---|---|---|
Match No Data pairs? |
Yes/No |
This option determines the result of a comparison when it compares two No Data (Null, or containing only whitespace characters) values for an identifier. If set to No, the comparison will give a 'no data' result when comparing a No Data value against another No Data value. If set to Yes, the comparison will give a result of 0 when comparing a No Data value against another No Data value. A 'no data' result will only be returned if a No Data value is compared against a populated value. |
No |
Ignore case? |
Yes/No |
Sets whether or not to ignore case when comparing values. |
Yes |
Include substrings greater than length |
Yes/No |
Common substrings between the two values being compared must be greater than the specified value to contribute to the overall Longest Common Substring Sum score. If set to 3, distinct (non over-lapping) substrings of 4 or more characters that are common between two values will be included in the LCSS calculation. For example, the values "Acme Micros Ltd Serv" and "Acme and Partners Micro Services Ltd" would give an LCSS of 9, assuming whitespace is trimmed before comparing. This would be calculated as 4 characters for the common substring "Acme", and 5 characters for the common substring "Micro". Note that the common substring "Ltd" would not be included in the calculation as its length is not greater than 3 characters. |
4 |
Example
In this example, the Longest Common Substring Sum comparison is used to identify possible matches in company names.
The following options are specified:
-
Match No Data pairs? = No
-
Ignore case? = Yes
-
Include substrings greater than length = 3
A Trim Whitespace transformation is used to remove all whitespace from values before they are compared.
Example results
With the above configuration, the following table illustrates some comparison results:
Table 1-52 Example Results: Longest Common Substring Sum
Value A | Value B | Comparison Result |
---|---|---|
Friars St Dental Practice |
Friar Street Dental Pract. |
18 |
Britannia Preservations |
Britannia Preservation Ltd |
21 |
Barraclough Partners |
Barraclough Stiles and Partners |
19 |
Gem Distribution Ltd |
Gem Distribution Ltd (Wildings) |
18 |
Think Consulting Ltd |
Think Training |
18 |
Logist Services and Distribution |
Consulting Ltd |
18 |
Logist Distribution & Services |
Logist Servs and Dist Logist Services & Distribution |
26 |