Converting Character Sets

The web is going the way of utf8. Drizzle has chosen it as the default character set, most back-ends to websites use it to store text data, and those who are still using latin1 have begun to migrate their databases to utf8. Googling for “mysql convert charset to utf8” results in a plethora of sites, each with a slightly different approach, and each broken in some respect. I’ll outline those approaches here and show why they don’t work, and then present a script that can generically be used to convert a database (or set of tables) to a target character set and collation.

Approach #1:

Take the following table as an example why this approach will not work:

Notice the implicit conversion of c1 from text to mediumtext. This approach can result in modified data types and silent data truncation, which makes it unacceptable for our purposes.

Approach #2 (outlined here):

This approach avoids the issue of implicit conversions by changing each data type to it’s binary counterpart before conversion. Due to implementation limitations, however, it also converts any pre-existing binary columns to their text counterpart. Additionally, this approach will fail because a binary column cannot be part of a FULLTEXT index. Even if these limitations are overcome, this process is inherently unsuitable for large databases because it requires multiple alter statements to be run on each table:

1) Drop FULLTEXT indexes
2) Convert target columns to their binary counterparts
3) Convert the table to the target character set
4) Convert target columns to their original data types
5) Add FULLTEXT indexes back

For those of us routinely waiting hours, if not days, for a single alter statement to finish, this is unacceptable.

Approach #3:

Dumping the entire database and re-importing it with the appropriate server & client character sets.

This is a three-step process, where one must first dump only the schema and then edit it by hand to have the appropriate character sets and the dump the data separately. After which, the schema must be re-created and data imported. If you’re using replication, this usually isn’t even an option because you’ll have a ridiculous amount of binary logs and force a reload of data on every server in the replication chain (very time/bandwidth/disk space consuming).

Except for Approach #1, these approaches are much more difficult than they need to be. Consider the following ALTER statement against the table in Approach #1:

This approach will both change the default character set for the table and target column, while leaving in place any FULLTEXT indexes. It also requires only a single ALTER statement for a given table. A perl script has been put together to parallel-ize the ALTER statements and is available at:

It will be added to Percona Tools on Launchpad (or perhaps maatkit, if it proves useful enough) once it is feature complete. Outstanding issues include:

– Proper handling of string foreign keys (currently fails, but you probably shouldn’t be using strings as foreign keys anyway …)
– Allow throttling of the number of threads created (currently creates one per table)

Share this post

Comments (25)

  • Shlomi Noach

    Shameless plug:

    Allow me to suggest using oak-modify-charset, part of the openark kit, which modifies a single column of text.
    At current the utility supports a single column change, but as it matures, it will also support streaming of column, aggregated into one ALTER statement.
    You may also just –print-only, and get the ALTER command without executing it, so you can still tailor it using your favorite PERL/awk/python script.

    Converting an entire table is undesired, in my opinion, since, in addition to said issue, it converts all columns in that table. What if I have a CHAR(32) column for some md5 shcecksum? I wouldn’t that to grow to 96 UTF8 bytes.

    March 17, 2009 at 9:29 pm
  • Mrten

    You should store things like md5 and sha1 as binary data in the first place, as it is wasteful (two times to be precise) to store it in hex, with no tangible benefits. Likewise for IP-addresses (the ones you’re not searching for, at least).

    March 18, 2009 at 3:47 am
  • Jakub Vrána

    There are two kinds of character set conversion:

    1. If the columns are declared as latin1 and the data are really stored in latin1. Then a simple MODIFY can convert everything.

    2. If the columns are declared as latin1 but the data are stored in utf8. This is quite common for applications created with MySQL < 4.1 or for applications which does not use SET CHARACTER SET or SET NAMES. A simple MODIFY would mangle the data and conversion to binary and then to utf8 is necessary.

    March 18, 2009 at 3:49 am
  • Shlomi Noach

    @Mrten,

    I partially agree. As exceptions, take a look at mysql.user: the hashed passwords are textual, not binary. Also, I believe indexing a textual column is easier to handle – but that may be a minor issue.

    With regard to IP addresses – I completely agree. I even wrote a post about it a few months ago.

    thanks

    March 18, 2009 at 1:41 pm
  • Shlomi Noach

    @Mrten

    At any case other examples can be email addresses, local file names (hopefully ASCII), mount points, airports abbreviations, etc. There are quite a few ASCII-only texts, which was my point.

    Regards

    March 18, 2009 at 9:18 pm
  • Marki

    @Jakub
    I’ve just migrated around 300 databases from 4.0 to 5.0. It was not easy, because every database could use different charset…
    1) Know the charset of database
    2) Export database from mysql 4.0 server
    3) Create new empty database on mysql 5.0 with correct default charset
    4) Import data into mysql 5.0 server, but prepend data with SET NAMES:
    (echo “SET NAMES $chset;”; cat $path/$db.sql) | mysql newdb -hnewserver

    March 19, 2009 at 2:54 am
  • Pavel

    What happens if a column has mixed latin1 and utf8 at the same time (left from MySQL 4)?
    Does MySQL detect that right?

    March 23, 2009 at 3:54 am
  • Ryan Lowe

    @shlomi Had I seen that script sooner, I may have used it! convert_charset has –columns parameter, so it is possible to do one-and-and-only-one column at a time.

    @jakub & @pavel This script assumes latin1 data is stored in latin1 columns:)

    March 23, 2009 at 7:30 am
  • Shlomi Noach

    @Ryan,

    Cheers. There’s room for more than one utility per task!

    Shlomi

    March 23, 2009 at 8:36 am
  • Gil

    In a multi-master replication setup, if I broke replication and ran this script on the passive master, then started replication from active->passive, would the queries fail or do other unwanted things? Are the binary logs charset-sensitive during replication?

    March 25, 2009 at 9:08 am
  • Shlomi Noach

    @Gil
    If you have a utf8 column on a master, and a latin1 column on a slave, and try to replicate, you may get corrupted data if you insert true utf8 text into the master. The slave will not be able to store utf8 characters and will truncate them.

    March 25, 2009 at 9:51 am
  • Gil

    Thanks @Schlomi

    My question was more about what happens if you have latin1 on the master and utf8 on the slave. Same problem?

    March 25, 2009 at 9:53 am
  • Shlomi Noach

    @Gil,

    good question. Will try to set up the configuration and test.

    March 25, 2009 at 10:26 am
  • nadavkav

    Thank you ! for this valuable post.
    But it was a little to late for me, after trying one of those suggestions on the internet
    similar to the #2 method category for converting text to binary and back to text again.
    We use a Mahara (ePortfolio framework) that was installed with the wrong charset
    and got filled with data very quickly, before i had that chance to notice this issue.
    eventually (after #2 method did not wo