Data Quality Operators
You can expand data quality capabilities within Oracle Database with string matching operators PHONIC_ENCODE
and FUZZY_MATCH
.
These operators can help you find near duplicate rows by matching strings that sounds alike or have small differences in spelling, for example:
-
"Chris" and "Kris", in strings that sound alike
-
"kitten" and "sitten", in strings that have small differences in spelling
FUZZY_MATCH
FUZZY_MATCH
takes the algorithm to be used as the first argument, the strings to be processed as the second and third arguments, and some optional arguments that control the quality of the desired output.
The UTL_MATCH
package evaluates byte by byte, while FUZZY_MATCH
evaluates character by character. Therefore UTL_MATCH
only works for comparison between single-byte strings while FUZZY_MATCH
handles multi-byte charactersets.
When the UNSCALED
option is specified, FUZZY_MATCH
returns a measure in characters for the following algorithms: LEVENSHTEIN
, DAMERAU_LEVENSHTEIN
, BIGRAM
, TRIGRAM
, LONGEST_COMMON_SUBSTRING
.
Arguments
The supported algorithms are:
-
LEVENSHTEIN
corresponds toUTL_MATCH.EDIT_DISTANCE
orUTL_MATCH.EDIT_SIMILARITY
and gives a measure of character edit distance or similarity. -
DAMERAU_LEVENSHTEIN
distance differs from the classicalLEVENSHTEIN
distance by including transpositions among its allowable operations in addition to the three classical single-character edit operations (insertions, deletions and substitutions). -
JARO_WINKLER
corresponds toUTL_MATCH.JARO_WINKLER
(a percentage between 0-1) orUTL_MATCH.JARO_WINKLER_SIMILARITY
(the same but scaled from 0-100). -
BIGRAM
andTRIGRAM
are instances of the N-gram matching technique, which counts the number of common contiguous sub-strings (grams) between the two strings. -
WHOLE_WORD_MATCH
corresponds to Word Match Percentage or Count comparison in Oracle Enterprise Data Quality. It calculates theLEVENSHTEIN
or edit distance of two phrases with words (instead of letters) as matching units. -
LONGEST_COMMON_SUBSTRING
finds the longest common substring between the two strings.
Both str
arguments can be any of the datatypes CHAR
, VARCHAR2
, NCHAR
, NVARCHAR2
.
UNSCALED
The keyword UNSCALED
is optional. If you specify UNSCALED
the return is one of :
-
LEVENSHTEIN
or edit distance -
JARO_WINKLER
value in percentage -
N-grams, the number of common substrings
-
LCS, the length of the longest common substring
RELATE_TO_SHORTER
The keyword RELATE_TO_SHORTER
is optional. If you specify RELATE_TO_SHORTER
, then the similarity measure is scaled by the length of the shorter input string. If you do not specify RELATE_TO_SHORTER
, then the default behavior is that the longer string length is used as the denominator.
EDIT_TOLERANCE
The keyword EDIT_TOLERANCE
is optional. You can only specify EDIT_TOLERANCE
with the WHOLE_WORD_MATCH
algorithm. If you specify EDIT_TOLERANCE
, the character error tolerance is the maximum percentage of the number of characters in a word that you allow to be different, while still considering each word as the same.
Returns
The operator returns NUMBER
. By default, it is a similarity score normalized to be a percentage between 0-100.
Examples
SQL> select fuzzy_match(LEVENSHTEIN, 'Mohamed Tarik', 'Mo Tariq') from dual; FUZZY_MATCH(LEVENSHTEIN,'MOHAMEDTARIK','MOTARIQ') ------------------------------------------------- 54 1 row selected.
SQL> select fuzzy_match(LEVENSHTEIN, 'Mohamed Tarik', 'Mo Tariq', unscaled) from dual; FUZZY_MATCH(LEVENSHTEIN,'MOHAMEDTARIK','MOTARIQ',UNSCALED) ---------------------------------------------------------- 6 1 row selected.
SQL> select fuzzy_match(DAMERAU_LEVENSHTEIN, 'Mohamed Tarik', 'Mo Tariq', relate_to_shorter) from dual; FUZZY_MATCH(DAMERAU_LEVENSHTEIN,'MOHAMEDTARIK','MOTARIQ',RELATE_TO_SHORTER) --------------------------------------------------------------------------- 25 1 row selected.
SQL> select fuzzy_match(BIGRAM, 'Mohamed Tarik', 'Mo Tariq', unscaled) from dual; FUZZY_MATCH(BIGRAM,'MOHAMEDTARIK','MOTARIQ',UNSCALED) ----------------------------------------------------- 5 1 row selected.
SQL> select fuzzy_match(LONGEST_COMMON_SUBSTRING, 'Mohamed Tarik', 'Mo Tariq', unscaled) from dual; FUZZY_MATCH(LONGEST_COMMON_SUBSTRING,'MOHAMEDTARIK','MOTARIQ',UNSCALED) ----------------------------------------------------------------------- 5 1 row selected.
SQL> select fuzzy_match(WHOLE_WORD_MATCH, 'Mohamed Tarik', 'Mo Tariq') from dual; FUZZY_MATCH(WHOLE_WORD_MATCH,'MOHAMEDTARIK','MOTARIQ') ------------------------------------------------------ 0 1 row selected
SQL> select fuzzy_match(WHOLE_WORD_MATCH, 'Pawan Kumar Goel', 'Pavan Kumar G', EDIT_TOLERANCE 60) from dual; FUZZY_MATCH(WHOLE_WORD_MATCH,'PAWANKUMARGOEL','PAVANKUMARG',EDIT_TOLERANCE60) ----------------------------------------------------------------------------- 67 1 row selected.
PHONIC_ENCODE
PHONIC_ENCODE
takes the algorithm to be used as the first argument, the string to be processed as the second argument, and an optional max_code_len
argument that controls the length of the desired output. max_code_len
must be an integer between 1 and 12.
Arguments
DOUBLE_METAPHONE
returns the primary code. DOUBLE_METAPHONE_ALT
returns the alternative code if present. If the alternative code is not present, it returns the primary code.
str
can be any of the datatypes CHAR
, VARCHAR2
, NCHAR
, NVARCHAR2
.
The optional argument max_code_len
must be an integer. It allows codes longer than the default 4 characters to be returned for the original Metaphone algorithm.
Returns
The operator returns VARCHAR2
.
Examples
SQL> select phonic_encode(DOUBLE_METAPHONE, 'smith') c1, 2 phonic_encode(DOUBLE_METAPHONE_ALT, 'smith') c2 from dual; C1 C2 ------------- ------------- SM0 XMT 1 row selected.
SQL> select phonic_encode(DOUBLE_METAPHONE, 'Schmidt') c1, 2 phonic_encode(DOUBLE_METAPHONE_ALT, 'Schmidt') c2 from dual; C1 C2 ------------- ------------- XMT SMT 1 row selected.
SQL> select phonic_encode(DOUBLE_METAPHONE, 'phone') c1, 2 phonic_encode(DOUBLE_METAPHONE_ALT, 'phone') c2 from dual; C1 C2 ------------- ------------- FN FN 1 row selected.
SQL> select phonic_encode(DOUBLE_METAPHONE, 'George') c1, 2 phonic_encode(DOUBLE_METAPHONE_ALT, 'George') c2 from dual; C1 C2 ------------- ------------- JRJ KRK 1 row selected.
SQL> -- PNNT / PKNNT SQL> select phonic_encode(DOUBLE_METAPHONE, 'poignant') c1, 2 phonic_encode(DOUBLE_METAPHONE_ALT, 'poignant') c2, 3 phonic_encode(DOUBLE_METAPHONE_ALT, 'poignant', 10) c3 from dual; C1 C2 C3 ------------- ------------- ------------- PNNT PKNN PKNNT