1.3.3.9.16 Match Transformation: Normalize Whitespace
The Normalize Whitespace transformation normalizes all the whitespace in a String so that all white spaces in between words are a single space character. It also removes leading and trailing white spaces.
Whitespace is defined in EDQ as:
-
Spaces
-
Non-printable characters, such as carriage returns, line feeds and tabs (and all other ASCII characters 0-31)
Use the Normalize Whitespace transformation when keying errors such as multiple spaces may occur in a dataset. Normalizing whitespace may be useful in comparisons, for example, to ensure that the Character edit distance (see Comparison: Character Edit Distance) of values does not discern any difference between a single space and many spaces. It may also be useful when clustering, before a Make Array from String transformation, so that forms of whitespace other than a space (such as carriage returns, tabs or other non-printing characters) can all effectively be used as delimiters. This would mean that the values "John[space]Simpson" and "John[tab][space]Simpson" would be tokenized identically, rather than the latter value yielding a "John[tab]" cluster value, which is different to "John".
Options
None.
Example transformations
The following table shows example transformations:
Table 1-88 Example Transformations for Normalize Whitespace
Value | Transformed Value |
---|---|
John[space][tab][carriage return]Simpson |
John[space]Simpson |
John[space][space]Simpson |
John[space]Simpson |
[space]John[space]Simpson |
John[space]Simpson |
John[space]Simpson[space][carriage return] |
John[space]Simpson |