EmergencyEMERGENCY? Get 24/7 Help Now!

utf8 data on latin1 tables: converting to utf8 without downtime or double encoding

 | October 16, 2013 |  Posted In: Insight for DBAs, MySQL

PREVIOUS POST
NEXT POST

Here’s a problem some or most of us have encountered. You have a latin1 table defined like below, and your application is storing utf8 data to the column on a latin1 connection. Obviously, double encoding occurs. Now your development team decided to use utf8 everywhere, but during the process you can only have as little to no downtime while keeping your stored data valid.

One approach here is as described to the manual is to convert the TEXT column into BLOB, then convert the table character set to utf8 and the c column back to TEXT, like this:

All good so far, but, if the tables are too big or big enough to disrupt your application significantly without downtime, this becomes a problem. The old little trick of using slaves now comes into play. In a nutshell, you can convert the TEXT column first on a slave into BLOB, then switch your application to use this slave as its PRIMARY. Any utf8 data written via replication or from the application should be stored and retrieved without issues either via latin1 connection character set or otherwise. This is because the BINARY data type does not really have character sets. Let me show you:

As you can see, while the column is still in BLOB, I have no problems reading or storing utf8 data into it. Now, after your application has been configured to use this slave and use utf8 connection, you can now convert the column and the table back to TEXT and utf8 character set.

Some caveats though, you cannot replicate from BLOB or utf8 back to the latin1 column, so you will have to discard the data from the original master. Doing so will just result in double encoding. Second, while the column is in BLOB or any other BINARY type and this column is indexed, you may experience different results when the index is used. This is because BINARY data is indexed based on their numeric values per bytes not per character strings. Here is an example:

See how the results are now ordered differently?

What’s your utf8 horror? Share with us on the comments below 🙂

UPDATE: This was how the process looks like without downtime or extended table being blocked, but there are other ways. One of them is creating a copy of the original table converted to utf8 and doing an  INSERT INTO .. SELECT  using the  CAST  or  CONVERT  functions like below.

Another method is to copy the FRM file of the same table structure but in  utf8  and replace your original table’s FRM file. Since the data is already stored as utf8, you should be able to read them on utf8 connection. However, you will have to rebuild you indexes based on affected columns as they are sorted as latin1 originally. In my tests though, there was no difference before and after rebuilding the index, so, YMMV. To demonstrate, still the same 2 previous tables – on the filesystem, I replaced  t.frm  with a copy of  x.frm  then did a FLUSH TABLES , afterwards, t looked like this:

Now, attempting to read the data on latin1 connection causes truncation:

But on utf8, I am now able to read it fine:

Rebuilding the secondary key on  c  column has no difference on the results too.

UPDATE: Apparently, the last method will not work for InnoDB tables because the character collation is stored in the data dictionary too as my colleague Alexander Rubin pointed out. But not all how is lost, you can still rebuild the table with pt-online-schema-change without blocking it.

PREVIOUS POST
NEXT POST
Jervin Real

As Senior Consultant, Jervin partners with Percona's customers on building reliable and highly performant MySQL infrastructures while also doing other fun stuff like watching cat videos on the internet. Jervin joined Percona in Apr 2010.

4 Comments

Leave a Reply

 
 

Percona’s widely read Percona Data Performance blog highlights our expertise in enterprise-class software, support, consulting and managed services solutions for both MySQL® and MongoDB® across traditional and cloud-based platforms. The decades of experience represented by our consultants is found daily in numerous and relevant blog posts.

Besides specific database help, the blog also provides notices on upcoming events and webinars.
Want to get weekly updates listing the latest blog posts? Subscribe to our blog now! Submit your email address below and we’ll send you an update every Friday at 1pm ET.

No, thank you. Please do not ask me again.