MySQL 9.3 Reference Manual Including MySQL NDB Cluster 9.3
This section describes the LDML syntax that MySQL recognizes.
This is a subset of the syntax described in the LDML
specification available at
http://www.unicode.org/reports/tr35/, which
should be consulted for further information. MySQL recognizes
a large enough subset of the syntax that, in many cases, it is
possible to download a collation definition from the Unicode
Common Locale Data Repository and paste the relevant part
(that is, the part between the
<rules>
and
</rules>
tags) into the MySQL
Index.xml
file. The rules described here
are all supported except that character sorting occurs only at
the primary level. Rules that specify differences at secondary
or higher sort levels are recognized (and thus can be included
in collation definitions) but are treated as equality at the
primary level.
The MySQL server generates diagnostics when it finds problems
while parsing the Index.xml
file. See
Section 12.14.4.3, “Diagnostics During Index.xml Parsing”.
Character Representation
Characters named in LDML rules can be written literally or in
\u
format,
where nnnn
nnnn
is the hexadecimal
Unicode code point value. For example, A
and á
can be written literally or as
\u0041
and \u00E1
.
Within hexadecimal values, the digits A
through F
are not case-sensitive;
\u00E1
and \u00e1
are
equivalent. For UCA 4.0.0 collations, hexadecimal notation can
be used only for characters in the Basic Multilingual Plane,
not for characters outside the BMP range of
0000
to FFFF
. For UCA
5.2.0 collations, hexadecimal notation can be used for any
character.
The Index.xml
file itself should be
written using UTF-8 encoding.
Syntax Rules
LDML has reset rules and shift rules to specify character ordering. Orderings are given as a set of rules that begin with a reset rule that establishes an anchor point, followed by shift rules that indicate how characters sort relative to the anchor point.
A <reset>
rule does not specify
any ordering in and of itself. Instead, it
“resets” the ordering for subsequent shift
rules to cause them to be taken in relation to a given
character. Either of the following rules resets subsequent
shift rules to be taken in relation to the letter
'A'
:
<reset>A</reset> <reset>\u0041</reset>
The <p>
,
<s>
, and
<t>
shift rules define primary,
secondary, and tertiary differences of a character from
another character:
Use primary differences to distinguish separate letters.
Use secondary differences to distinguish accent variations.
Use tertiary differences to distinguish lettercase variations.
Either of these rules specifies a primary shift rule for
the 'G'
character:
<p>G</p> <p>\u0047</p>
The <i>
shift rule indicates that
one character sorts identically to another. The following
rules cause 'b'
to sort the same as
'a'
:
<reset>a</reset> <i>b</i>
Abbreviated shift syntax specifies multiple shift rules using a single pair of tags. The following table shows the correspondence between abbreviated syntax rules and the equivalent nonabbreviated rules.
Table 12.5 Abbreviated Shift Syntax
Abbreviated Syntax | Nonabbreviated Syntax |
---|---|
<pc>xyz</pc> |
<p>x</p><p>y</p><p>z</p> |
<sc>xyz</sc> |
<s>x</s><s>y</s><s>z</s> |
<tc>xyz</tc> |
<t>x</t><t>y</t><t>z</t> |
<ic>xyz</ic> |
<i>x</i><i>y</i><i>z</i> |
An expansion is a reset rule that establishes an anchor
point for a multiple-character sequence. MySQL supports
expansions 2 to 6 characters long. The following rules put
'z'
greater at the primary level than
the sequence of three characters 'abc'
:
<reset>abc</reset> <p>z</p>
A contraction is a shift rule that sorts a
multiple-character sequence. MySQL supports contractions 2
to 6 characters long. The following rules put the sequence
of three characters 'xyz'
greater at
the primary level than 'a'
:
<reset>a</reset> <p>xyz</p>
Long expansions and long contractions can be used
together. These rules put the sequence of three characters
'xyz'
greater at the primary level than
the sequence of three characters 'abc'
:
<reset>abc</reset> <p>xyz</p>
Normal expansion syntax uses <x>
plus <extend>
elements to specify
an expansion. The following rules put the character
'k'
greater at the secondary level than
the sequence 'ch'
. That is,
'k'
behaves as if it expands to a
character after 'c'
followed by
'h'
:
<reset>c</reset> <x><s>k</s><extend>h</extend></x>
This syntax permits long sequences. These rules sort the
sequence 'ccs'
greater at the tertiary
level than the sequence 'cscs'
:
<reset>cs</reset> <x><t>ccs</t><extend>cs</extend></x>
The LDML specification describes normal expansion syntax as “tricky.” See that specification for details.
Previous context syntax uses <x>
plus <context>
elements to
specify that the context before a character affects how it
sorts. The following rules put '-'
greater at the secondary level than
'a'
, but only when
'-'
occurs after
'b'
:
<reset>a</reset> <x><context>b</context><s>-</s></x>
Previous context syntax can include the
<extend>
element. These rules put
'def'
greater at the primary level than
'aghi'
, but only when
'def'
comes after
'abc'
:
<reset>a</reset> <x><context>abc</context><p>def</p><extend>ghi</extend></x>
Reset rules permit a before
attribute.
Normally, shift rules after a reset rule indicate
characters that sort after the reset character. Shift
rules after a reset rule that has the
before
attribute indicate characters
that sort before the reset character. The following rules
put the character 'b'
immediately
before 'a'
at the primary level:
<reset before="primary">a</reset> <p>b</p>
Permissible before
attribute values
specify the sort level by name or the equivalent numeric
value:
<reset before="primary"> <reset before="1"> <reset before="secondary"> <reset before="2"> <reset before="tertiary"> <reset before="3">
A reset rule can name a logical reset position rather than a literal character:
<first_tertiary_ignorable/> <last_tertiary_ignorable/> <first_secondary_ignorable/> <last_secondary_ignorable/> <first_primary_ignorable/> <last_primary_ignorable/> <first_variable/> <last_variable/> <first_non_ignorable/> <last_non_ignorable/> <first_trailing/> <last_trailing/>
These rules put 'z'
greater at the
primary level than nonignorable characters that have a
Default Unicode Collation Element Table (DUCET) entry and
that are not CJK:
<reset><last_non_ignorable/></reset> <p>z</p>
Logical positions have the code points shown in the following table.
Table 12.6 Logical Reset Position Code Points
Logical Position | Unicode 4.0.0 Code Point | Unicode 5.2.0 Code Point |
---|---|---|
<first_non_ignorable/> |
U+02D0 | U+02D0 |
<last_non_ignorable/> |
U+A48C | U+1342E |
<first_primary_ignorable/> |
U+0332 | U+0332 |
<last_primary_ignorable/> |
U+20EA | U+101FD |
<first_secondary_ignorable/> |
U+0000 | U+0000 |
<last_secondary_ignorable/> |
U+FE73 | U+FE73 |
<first_tertiary_ignorable/> |
U+0000 | U+0000 |
<last_tertiary_ignorable/> |
U+FE73 | U+FE73 |
<first_trailing/> |
U+0000 | U+0000 |
<last_trailing/> |
U+0000 | U+0000 |
<first_variable/> |
U+0009 | U+0009 |
<last_variable/> |
U+2183 | U+1D371 |
The <collation>
element permits a
shift-after-method
attribute that
affects character weight calculation for shift rules. The
attribute has these permitted values:
simple
: Calculate character weights
as for reset rules that do not have a
before
attribute. This is the
default if the attribute is not given.
expand
: Use expansions for shifts
after reset rules.
Suppose that '0'
and
'1'
have weights of
0E29
and 0E2A
and we
want to put all basic Latin letters between
'0'
and '1'
:
<reset>0</reset> <pc>abcdefghijklmnopqrstuvwxyz</pc>
For simple shift mode, weights are calculated as follows:
'a' has weight 0E29+1 'b' has weight 0E29+2 'c' has weight 0E29+3 ...
However, there are not enough vacant positions to put 26
characters between '0'
and
'1'
. The result is that digits and
letters are intermixed.
To solve this, use
shift-after-method="expand"
. Then
weights are calculated like this:
'a' has weight [0E29][233D+1] 'b' has weight [0E29][233D+2] 'c' has weight [0E29][233D+3] ...
233D
is the UCA 4.0.0 weight for
character 0xA48C
, which is the last
nonignorable character (a sort of the greatest character
in the collation, excluding CJK). UCA 5.2.0 is similar but
uses 3ACA
, for character
0x1342E
.
MySQL-Specific LDML Extensions
An extension to LDML rules permits the
<collation>
element to include an
optional version
attribute in
<collation>
tags to indicate the UCA
version on which the collation is based. If the
version
attribute is omitted, its default
value is 4.0.0
. For example, this
specification indicates a collation that is based on UCA
5.2.0:
<collation id="nnn
" name="utf8mb4_xxx
_ci" version="5.2.0"> ... </collation>