Converting Character Sets

The web is going the way of utf8. Drizzle has chosen it as the default character set, most back-ends to websites use it to store text data, and those who are still using latin1 have begun to migrate their databases to utf8. Googling for “mysql convert charset to utf8” results in a plethora of sites, each with a slightly different approach, and each broken in some respect. I’ll outline those approaches here and show why they don’t work, and then present a script that can generically be used to convert a database (or set of tables) to a target character set and collation.

Approach #1:

Take the following table as an example why this approach will not work:

Notice the implicit conversion of c1 from text to mediumtext. This approach can result in modified data types and silent data truncation, which makes it unacceptable for our purposes.

Approach #2 (outlined here):

This approach avoids the issue of implicit conversions by changing each data type to it’s binary counterpart before conversion. Due to implementation limitations, however, it also converts any pre-existing binary columns to their text counterpart. Additionally, this approach will fail because a binary column cannot be part of a FULLTEXT index. Even if these limitations are overcome, this process is inherently unsuitable for large databases because it requires multiple alter statements to be run on each table:

1) Drop FULLTEXT indexes
2) Convert target columns to their binary counterparts
3) Convert the table to the target character set
4) Convert target columns to their original data types
5) Add FULLTEXT indexes back

For those of us routinely waiting hours, if not days, for a single alter statement to finish, this is unacceptable.

Approach #3:

Dumping the entire database and re-importing it with the appropriate server & client character sets.

This is a three-step process, where one must first dump only the schema and then edit it by hand to have the appropriate character sets and the dump the data separately. After which, the schema must be re-created and data imported. If you’re using replication, this usually isn’t even an option because you’ll have a ridiculous amount of binary logs and force a reload of data on every server in the replication chain (very time/bandwidth/disk space consuming).

Except for Approach #1, these approaches are much more difficult than they need to be. Consider the following ALTER statement against the table in Approach #1:

This approach will both change the default character set for the table and target column, while leaving in place any FULLTEXT indexes. It also requires only a single ALTER statement for a given table. A perl script has been put together to parallel-ize the ALTER statements and is available at:

It will be added to Percona Tools on Launchpad (or perhaps maatkit, if it proves useful enough) once it is feature complete. Outstanding issues include:

– Proper handling of string foreign keys (currently fails, but you probably shouldn’t be using strings as foreign keys anyway …)
– Allow throttling of the number of threads created (currently creates one per table)

Share this post

Comments (25)

  • Shlomi Noach

    Shameless plug:

    Allow me to suggest using oak-modify-charset, part of the openark kit, which modifies a single column of text.
    At current the utility supports a single column change, but as it matures, it will also support streaming of column, aggregated into one ALTER statement.
    You may also just –print-only, and get the ALTER command without executing it, so you can still tailor it using your favorite PERL/awk/python script.

    Converting an entire table is undesired, in my opinion, since, in addition to said issue, it converts all columns in that table. What if I have a CHAR(32) column for some md5 shcecksum? I wouldn’t that to grow to 96 UTF8 bytes.

    March 17, 2009 at 9:29 pm
  • Mrten

    You should store things like md5 and sha1 as binary data in the first place, as it is wasteful (two times to be precise) to store it in hex, with no tangible benefits. Likewise for IP-addresses (the ones you’re not searching for, at least).

    March 18, 2009 at 3:47 am
  • Jakub Vrána

    There are two kinds of character set conversion:

    1. If the columns are declared as latin1 and the data are really stored in latin1. Then a simple MODIFY can convert everything.

    2. If the columns are declared as latin1 but the data are stored in utf8. This is quite common for applications created with MySQL < 4.1 or for applications which does not use SET CHARACTER SET or SET NAMES. A simple MODIFY would mangle the data and conversion to binary and then to utf8 is necessary.

    March 18, 2009 at 3:49 am
  • Shlomi Noach


    I partially agree. As exceptions, take a look at mysql.user: the hashed passwords are textual, not binary. Also, I believe indexing a textual column is easier to handle – but that may be a minor issue.

    With regard to IP addresses – I completely agree. I even wrote a post about it a few months ago.


    March 18, 2009 at 1:41 pm
  • Shlomi Noach


    At any case other examples can be email addresses, local file names (hopefully ASCII), mount points, airports abbreviations, etc. There are quite a few ASCII-only texts, which was my point.


    March 18, 2009 at 9:18 pm
  • Marki

    I’ve just migrated around 300 databases from 4.0 to 5.0. It was not easy, because every database could use different charset…
    1) Know the charset of database
    2) Export database from mysql 4.0 server
    3) Create new empty database on mysql 5.0 with correct default charset
    4) Import data into mysql 5.0 server, but prepend data with SET NAMES:
    (echo “SET NAMES $chset;”; cat $path/$db.sql) | mysql newdb -hnewserver

    March 19, 2009 at 2:54 am
  • Pavel

    What happens if a column has mixed latin1 and utf8 at the same time (left from MySQL 4)?
    Does MySQL detect that right?

    March 23, 2009 at 3:54 am
  • Ryan Lowe

    @shlomi Had I seen that script sooner, I may have used it! convert_charset has –columns parameter, so it is possible to do one-and-and-only-one column at a time.

    @jakub & @pavel This script assumes latin1 data is stored in latin1 columns:)

    March 23, 2009 at 7:30 am
  • Shlomi Noach


    Cheers. There’s room for more than one utility per task!


    March 23, 2009 at 8:36 am
  • Gil

    In a multi-master replication setup, if I broke replication and ran this script on the passive master, then started replication from active->passive, would the queries fail or do other unwanted things? Are the binary logs charset-sensitive during replication?

    March 25, 2009 at 9:08 am
  • Shlomi Noach

    If you have a utf8 column on a master, and a latin1 column on a slave, and try to replicate, you may get corrupted data if you insert true utf8 text into the master. The slave will not be able to store utf8 characters and will truncate them.

    March 25, 2009 at 9:51 am
  • Gil

    Thanks @Schlomi

    My question was more about what happens if you have latin1 on the master and utf8 on the slave. Same problem?

    March 25, 2009 at 9:53 am
  • Shlomi Noach


    good question. Will try to set up the configuration and test.

    March 25, 2009 at 10:26 am
  • nadavkav

    Thank you ! for this valuable post.
    But it was a little to late for me, after trying one of those suggestions on the internet
    similar to the #2 method category for converting text to binary and back to text again.
    We use a Mahara (ePortfolio framework) that was installed with the wrong charset
    and got filled with data very quickly, before i had that chance to notice this issue.
    eventually (after #2 method did not work properly) i had to manually change the relevant
    DB fields 🙁
    A good ending to this story is that everything works fine, now 🙂

    April 3, 2009 at 10:32 am
  • bfarber

    While the general information in the article is useful, there’s 2 issues I’m spotting (since I recently undertook the same task).

    1) Many databases store any charset text into the latin1 column, part of legacy MySQL 4 using latin1 as the default. While the text stores and can be displayed just fine from a front end application querying it, if you just change the character set I would assume MySQL is going to literally try to convert from latin1 to utf8, and in this case since the text isn’t latin1 it gets mangled.

    2) I’ve seen many applications that store serialized data in the database (from PHP serialize() function). This method would break that serialized string, making it impossible to deserialize it later.

    Just some things to point out.

    December 30, 2009 at 4:14 pm
  • Chris Huntley

    Hi Ryan,
    My blog is on a WordPress platform. Do you know if I’ll have to make the upgrade to utf8? Thanks.

    May 25, 2010 at 9:52 am
  • Pradeep Jindal

    Thanks for this great post. Here’s my attempt on handling FK:

    January 29, 2013 at 7:24 am
  • Tom

    I get a whole bunch of these when I run the script on windows:

    DBD::mysql::db quote failed: handle 2 is owned by thread cb0f8 not current thread 4fc00b8 (handles can’t be shared between threads and your driver may need a CLONE method added) at line 281, line 1.

    April 9, 2013 at 10:36 am
  • Mike

    Forgive me if I’m wrong, but isn’t mediumtext larger than text? If that is the case, that should not result in any truncation.

    May 9, 2013 at 9:45 am
  • sun


    Your convert_charset Perl script still works excellently (and the code also looks beautiful from a pure code standpoint ;)), and is especially required for big data, as well as replication scenarios – as you outlined already. Unlike the openark (oak) script mentioned above, it is really handy to have all table column conversions automated.

    To my surprise, there’s only 1 unmodified copy of your script on github, and also nowhere else on the net. It hasn’t been added to the Percona Tools yet.

    Time to give it a proper home? 🙂


    May 15, 2013 at 3:22 pm
  • Ryan


    Glad to hear the script worked for you! It wasn’t deemed generally useful enough to put into the Percona Tools set (or Percona-Toolkit) 😀

    I’ve given it a “proper home” at for the time being … Pull Requests are welcome 🙂

    — Ryan Lowe

    May 15, 2013 at 3:39 pm
  • George Lund

    Approach number 1 is correct.

    There was no truncation: MEDIUMTEXT is bigger than TEXT.

    MySQL does this to make sure there’s *no* truncation, as explained in the manual –

    “For a column that has a data type of VARCHAR or one of the TEXT types, CONVERT TO CHARACTER SET will change the data type as necessary to ensure that the new column is long enough to store as many characters as the original column.” [followed by a complete explanation of why TEXT coverts to MEDIUMTEXT]

    Maybe something changed since this article was originally written, but right now it seems downright misleading.

    June 12, 2013 at 9:27 am
  • David

    Correct me if i’m wrong, but this script doesn’t convert the data within the table? For instance, I have a table using latin1 swedish that I want to conver to UTF8. The table gets converted, but the data still remains as is.

    October 21, 2013 at 8:04 am
  • raul

    hi friends

    i have database in latin1_swedish_ci y need migrate to utf8 utf8_general_ci

    i use
    ./convert_charset --host= --user=root --password=mypass --database=database --charset=utf8 --collate=utf8_general_ci>/code>

    but this did not work

    not display anything on the web, even in mysql, if the change shows utf8_general_ci

    and everything looks fine

    anyone can help me

    November 15, 2014 at 9:06 pm
  • Stijn de Witt

    Thanks for this informative post.

    It is worth mentioning however that MySQL’s UTF-8 is not actually UTF-8… Think of it as ‘MySQL’s 3-byte version of a subset of UTF-8’. What the rest of the world calls UTF-8, and which takes up to a maximum of 4 bytes per character to encode, MySQL actually calls ‘utf8mb4’.


    May 31, 2016 at 6:16 am

Comments are closed.

Use Percona's Technical Forum to ask any follow-up questions on this blog topic.