utf8 data on latin1 tables: converting to utf8 without downtime or double encoding

Here’s a problem some or most of us have encountered. You have a latin1 table defined like below, and your application is storing utf8 data to the column on a latin1 connection. Obviously, double encoding occurs. Now your development team decided to use utf8 everywhere, but during the process you can only have as little to no downtime while keeping your stored data valid.

One approach here is as described to the manual is to convert the TEXT column into BLOB, then convert the table character set to utf8 and the c column back to TEXT, like this:

All good so far, but, if the tables are too big or big enough to disrupt your application significantly without downtime, this becomes a problem. The old little trick of using slaves now comes into play. In a nutshell, you can convert the TEXT column first on a slave into BLOB, then switch your application to use this slave as its PRIMARY. Any utf8 data written via replication or from the application should be stored and retrieved without issues either via latin1 connection character set or otherwise. This is because the BINARY data type does not really have character sets. Let me show you:

As you can see, while the column is still in BLOB, I have no problems reading or storing utf8 data into it. Now, after your application has been configured to use this slave and use utf8 connection, you can now convert the column and the table back to TEXT and utf8 character set.

Some caveats though, you cannot replicate from BLOB or utf8 back to the latin1 column, so you will have to discard the data from the original master. Doing so will just result in double encoding. Second, while the column is in BLOB or any other BINARY type and this column is indexed, you may experience different results when the index is used. This is because BINARY data is indexed based on their numeric values per bytes not per character strings. Here is an example:

See how the results are now ordered differently?

What’s your utf8 horror? Share with us on the comments below 🙂

UPDATE: This was how the process looks like without downtime or extended table being blocked, but there are other ways. One of them is creating a copy of the original table converted to utf8 and doing an  INSERT INTO .. SELECT  using the  CAST  or  CONVERT  functions like below.

Another method is to copy the FRM file of the same table structure but in  utf8  and replace your original table’s FRM file. Since the data is already stored as utf8, you should be able to read them on utf8 connection. However, you will have to rebuild you indexes based on affected columns as they are sorted as latin1 originally. In my tests though, there was no difference before and after rebuilding the index, so, YMMV. To demonstrate, still the same 2 previous tables – on the filesystem, I replaced  t.frm  with a copy of  x.frm  then did a FLUSH TABLES , afterwards, t looked like this:

Now, attempting to read the data on latin1 connection causes truncation:

But on utf8, I am now able to read it fine:

Rebuilding the secondary key on  c  column has no difference on the results too.

UPDATE: Apparently, the last method will not work for InnoDB tables because the character collation is stored in the data dictionary too as my colleague Alexander Rubin pointed out. But not all how is lost, you can still rebuild the table with pt-online-schema-change without blocking it.

Share this post

Comments (5)

  • Nate Reply

    We had a vaguely similar problem — we had UTF-8 columns, but were somehow passing our UTF-8 data into the database but converting it to Latin-1 while we did so. It was kind of crazy.

    I wrote up my understanding of the problem and how we solved it here: http://www.ridesidecar.com/2013/07/30/of-databases-and-character-encodings/

    October 16, 2013 at 10:44 am
  • Jackson Reply

    I’ve run into this multiple times. So frustration at times. Not so much a problem anymore

    October 16, 2013 at 5:39 pm
  • Parisa Reply

    How about Varchar and Char field ?? I have to change it to blob too ?

    November 7, 2013 at 3:05 pm
  • Jervin Real Reply

    @Parisa, in the example above yes, it can be BINARY or VARBINARY too.

    November 7, 2013 at 5:50 pm
  • Vivek Reply

    My previous data convert into special characters, unable to convert or change into correct characters

    mysql> SHOW VARIABLES LIKE ‘character\_set\_%’;
    | Variable_name | Value |
    | character_set_client | latin1 |
    | character_set_connection | latin1 |
    | character_set_database | latin1 |
    | character_set_filesystem | binary |
    | character_set_results | latin1 |
    | character_set_server | utf8 |
    | character_set_system | utf8 |
    7 rows in set (0.29 sec)

    Correct TExt
    INSERT INTO teachers (subject) VALUES (‘Bienvenu à l’École Mondiale de la Bible’);

    At mysql command line

    mysql> INSERT INTO teachers (subject) VALUES (‘Bienvenu testing_wbsL cole Mondiale de la Bible’);

    Completely changed into special characters

    May 22, 2017 at 1:00 pm

Leave a Reply