MySQL 8.0 Reference Manual Including MySQL NDB Cluster 8.0
This section describes issues that you may face when converting
character data between the utf8mb3
and
utf8mb4
character sets.
This discussion focuses primarily on converting between
utf8mb3
and utf8mb4
, but
similar principles apply to converting between the
ucs2
character set and character sets such
as utf16
or utf32
.
The utf8mb3
and utf8mb4
character sets differ as follows:
utf8mb3
supports only characters in the
Basic Multilingual Plane (BMP). utf8mb4
additionally supports supplementary characters that lie
outside the BMP.
utf8mb3
uses a maximum of three bytes per
character. utf8mb4
uses a maximum of four
bytes per character.
This discussion refers to the utf8mb3
and
utf8mb4
character set names to be explicit
about referring to 3-byte and 4-byte UTF-8 character set data.
One advantage of converting from utf8mb3
to
utf8mb4
is that this enables applications to
use supplementary characters. One tradeoff is that this may
increase data storage space requirements.
In terms of table content, conversion from
utf8mb3
to utf8mb4
presents no problems:
For a BMP character, utf8mb4
and
utf8mb3
have identical storage
characteristics: same code values, same encoding, same
length.
For a supplementary character, utf8mb4
requires four bytes to store it, whereas
utf8mb3
cannot store the character at
all. When converting utf8mb3
columns to
utf8mb4
, you need not worry about
converting supplementary characters because there are none.
In terms of table structure, these are the primary potential incompatibilities:
For the variable-length character data types
(VARCHAR
and the
TEXT
types), the maximum
permitted length in characters is less for
utf8mb4
columns than for
utf8mb3
columns.
For all character data types
(CHAR
,
VARCHAR
, and the
TEXT
types), the maximum
number of characters that can be indexed is less for
utf8mb4
columns than for
utf8mb3
columns.
Consequently, to convert tables from utf8mb3
to utf8mb4
, it may be necessary to change
some column or index definitions.
Tables can be converted from utf8mb3
to
utf8mb4
by using ALTER
TABLE
. Suppose that a table has this definition:
CREATE TABLE t1 ( col1 CHAR(10) CHARACTER SET utf8mb3 COLLATE utf8mb3_unicode_ci NOT NULL, col2 CHAR(10) CHARACTER SET utf8mb3 COLLATE utf8mb3_bin NOT NULL ) CHARACTER SET utf8mb3;
The following statement converts t1
to use
utf8mb4
:
ALTER TABLE t1 DEFAULT CHARACTER SET utf8mb4, MODIFY col1 CHAR(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL, MODIFY col2 CHAR(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL;
The catch when converting from utf8mb3
to
utf8mb4
is that the maximum length of a
column or index key is unchanged in terms of
bytes. Therefore, it is smaller in terms of
characters because the maximum length of a
character is four bytes instead of three. For the
CHAR
,
VARCHAR
, and
TEXT
data types, watch for these
issues when converting your MySQL tables:
Check all definitions of utf8mb3
columns
and make sure they do not exceed the maximum length for the
storage engine.
Check all indexes on utf8mb3
columns and
make sure they do not exceed the maximum length for the
storage engine. Sometimes the maximum can change due to
storage engine enhancements.
If the preceding conditions apply, you must either reduce the
defined length of columns or indexes, or continue to use
utf8mb3
rather than
utf8mb4
.
Here are some examples where structural changes may be needed:
A TINYTEXT
column can hold up
to 255 bytes, so it can hold up to 85 3-byte or 63 4-byte
characters. Suppose that you have a
TINYTEXT
column that uses
utf8mb3
but must be able to contain more
than 63 characters. You cannot convert it to
utf8mb4
unless you also change the data
type to a longer type such as
TEXT
.
Similarly, a very long
VARCHAR
column may need to be
changed to one of the longer
TEXT
types if you want to
convert it from utf8mb3
to
utf8mb4
.
InnoDB
has a maximum index length of 767
bytes for tables that use
COMPACT
or
REDUNDANT
row format, so for utf8mb3
or
utf8mb4
columns, you can index a maximum
of 255 or 191 characters, respectively. If you currently
have utf8mb3
columns with indexes longer
than 191 characters, you must index a smaller number of
characters.
In an InnoDB
table that uses
COMPACT
or
REDUNDANT
row format, these column and index definitions are legal:
col1 VARCHAR(500) CHARACTER SET utf8mb3, INDEX (col1(255))
To use utf8mb4
instead, the index must be
smaller:
col1 VARCHAR(500) CHARACTER SET utf8mb4, INDEX (col1(191))
For InnoDB
tables that use
COMPRESSED
or
DYNAMIC
row format, index key
prefixes longer than 767 bytes (up to 3072 bytes)
are permitted. Tables created with these row formats
enable you to index a maximum of 1024 or 768 characters
for utf8mb3
or
utf8mb4
columns, respectively. For
related information, see Section 17.22, “InnoDB Limits”,
and DYNAMIC Row Format.
The preceding types of changes are most likely to be required
only if you have very long columns or indexes. Otherwise, you
should be able to convert your tables from
utf8mb3
to utf8mb4
without
problems, using ALTER TABLE
as
described previously.
The following items summarize other potential incompatibilities:
SET NAMES 'utf8mb4'
causes use of the
4-byte character set for connection character sets. As long
as no 4-byte characters are sent from the server, there
should be no problems. Otherwise, applications that expect
to receive a maximum of three bytes per character may have
problems. Conversely, applications that expect to send
4-byte characters must ensure that the server understands
them.
For replication, if character sets that support supplementary characters are to be used on the source, all replicas must understand them as well.
Also, keep in mind the general principle that if a table has
different definitions on the source and replica, this can
lead to unexpected results. For example, the differences in
maximum index key length make it risky to use
utf8mb3
on the source and
utf8mb4
on the replica.
If you have converted to utf8mb4
,
utf16
, utf16le
, or
utf32
, and then decide to convert back to
utf8mb3
or ucs2
(for
example, to downgrade to an older version of MySQL), these
considerations apply:
utf8mb3
and ucs2
data
should present no problems.
The server must be recent enough to recognize definitions referring to the character set from which you are converting.
For object definitions that refer to the
utf8mb4
character set, you can dump them
with mysqldump prior to downgrading, edit
the dump file to change instances of
utf8mb4
to utf8
, and
reload the file in the older server, as long as there are no
4-byte characters in the data. The older server sees
utf8
in the dump file object definitions
and create new objects that use the (3-byte)
utf8
character set.