Downloads

Blog

utf8 data on latin1 tables: converting to utf8 without downtime or double encoding

October 16, 2013

Author

Jervin Real

Insight for DBAs

MySQL

Share this Post:

Here’s a problem some or most of us have encountered. You have a latin1 table defined like below, and your application is storing utf8 data to the column on a latin1 connection. Obviously, double encoding occurs. Now your development team decided to use utf8 everywhere, but during the process you can only have as little to no downtime while keeping your stored data valid.

CREATE TABLE `t` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `c` text,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

master> SET NAMES latin1;
master> INSERT INTO t (c) VALUES ('¡Celebración!');
master> SELECT id, c, HEX(c) FROM t;
+----+-----------------+--------------------------------+
| id | c               | HEX(c)                         |
+----+-----------------+--------------------------------+
|  3 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
+----+-----------------+--------------------------------+
1 row in set (0.00 sec)

master> SET NAMES utf8;
master> SELECT id, c, HEX(c) FROM t;
+----+---------------------+--------------------------------+
| id | c                   | HEX(c)                         |
+----+---------------------+--------------------------------+
|  3 | Â¡CelebraciÃ³n!     | C2A143656C656272616369C3B36E21 |
+----+---------------------+--------------------------------+
1 row in set (0.00 sec)

CREATE TABLE `t` (

`id` int(10) unsigned NOT NULL AUTO_INCREMENT,

`c` text,

PRIMARY KEY (`id`)

) ENGINE=InnoDB DEFAULT CHARSET=latin1;

master> SET NAMES latin1;

master> INSERT INTO t (c) VALUES ('¡Celebración!');

master> SELECT id, c, HEX(c) FROM t;

+----+-----------------+--------------------------------+

| id | c | HEX(c) |

+----+-----------------+--------------------------------+

| 3 | ¡Celebración! | C2A143656C656272616369C3B36E21 |

+----+-----------------+--------------------------------+

1 row in set (0.00 sec)

master> SET NAMES utf8;

master> SELECT id, c, HEX(c) FROM t;

+----+---------------------+--------------------------------+

| id | c | HEX(c) |

+----+---------------------+--------------------------------+

| 3 | Â¡CelebraciÃ³n! | C2A143656C656272616369C3B36E21 |

+----+---------------------+--------------------------------+

1 row in set (0.00 sec)

One approach here is as described to the manual is to convert the TEXT column into BLOB, then convert the table character set to utf8 and the c column back to TEXT, like this:

master> ALTER TABLE t CHANGE c c BLOB;
master> ALTER TABLE t CONVERT TO CHARACTER SET utf8, CHANGE c c TEXT;
master> SET NAMES utf8;
master> SELECT id, c, HEX(c) FROM t;  
+----+-----------------+--------------------------------+
| id | c               | HEX(c)                         |
+----+-----------------+--------------------------------+
|  3 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
+----+-----------------+--------------------------------+
1 row in set (0.00 sec)

master> ALTER TABLE t CHANGE c c BLOB;

master> ALTER TABLE t CONVERT TO CHARACTER SET utf8, CHANGE c c TEXT;

master> SET NAMES utf8;

master> SELECT id, c, HEX(c) FROM t;

+----+-----------------+--------------------------------+

| id | c | HEX(c) |

+----+-----------------+--------------------------------+

| 3 | ¡Celebración! | C2A143656C656272616369C3B36E21 |

+----+-----------------+--------------------------------+

1 row in set (0.00 sec)

All good so far, but, if the tables are too big or big enough to disrupt your application significantly without downtime, this becomes a problem. The old little trick of using slaves now comes into play. In a nutshell, you can convert the TEXT column first on a slave into BLOB, then switch your application to use this slave as its PRIMARY. Any utf8 data written via replication or from the application should be stored and retrieved without issues either via latin1 connection character set or otherwise. This is because the BINARY data type does not really have character sets. Let me show you:

slave> SET NAMES latin1;                  
slave> INSERT INTO t (c) VALUES ('¡Celebración!');
slave> SELECT id, c, HEX(c) FROM t;
+----+-----------------+--------------------------------+
| id | c               | HEX(c)                         |
+----+-----------------+--------------------------------+
|  3 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
+----+-----------------+--------------------------------+
1 row in set (0.00 sec)

slave> ALTER TABLE t CHANGE c c BLOB;
slave> SET NAMES latin1;                  
slave> INSERT INTO t (c) VALUES ('¡Celebración!');
slave> SELECT id, c, HEX(c) FROM t;               
+----+-----------------+--------------------------------+
| id | c               | HEX(c)                         |
+----+-----------------+--------------------------------+
|  3 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
|  4 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
+----+-----------------+--------------------------------+
2 rows in set (0.00 sec)

slave> SET NAMES utf8;
slave> INSERT INTO t (c) VALUES ('¡Celebración!');
slave> SELECT id, c, HEX(c) FROM t;               
+----+-----------------+--------------------------------+
| id | c               | HEX(c)                         |
+----+-----------------+--------------------------------+
|  3 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
|  4 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
|  5 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
+----+-----------------+--------------------------------+
3 rows in set (0.00 sec)

slave> SET NAMES latin1;

slave> INSERT INTO t (c) VALUES ('¡Celebración!');

slave> SELECT id, c, HEX(c) FROM t;

+----+-----------------+--------------------------------+

| id | c | HEX(c) |

+----+-----------------+--------------------------------+

| 3 | ¡Celebración! | C2A143656C656272616369C3B36E21 |

+----+-----------------+--------------------------------+

1 row in set (0.00 sec)

slave> ALTER TABLE t CHANGE c c BLOB;

slave> SET NAMES latin1;

slave> INSERT INTO t (c) VALUES ('¡Celebración!');

slave> SELECT id, c, HEX(c) FROM t;

+----+-----------------+--------------------------------+

| id | c | HEX(c) |

+----+-----------------+--------------------------------+

| 3 | ¡Celebración! | C2A143656C656272616369C3B36E21 |

| 4 | ¡Celebración! | C2A143656C656272616369C3B36E21 |

+----+-----------------+--------------------------------+

2 rows in set (0.00 sec)

slave> SET NAMES utf8;

slave> INSERT INTO t (c) VALUES ('¡Celebración!');

slave> SELECT id, c, HEX(c) FROM t;

+----+-----------------+--------------------------------+

| id | c | HEX(c) |

+----+-----------------+--------------------------------+

| 3 | ¡Celebración! | C2A143656C656272616369C3B36E21 |

| 4 | ¡Celebración! | C2A143656C656272616369C3B36E21 |

| 5 | ¡Celebración! | C2A143656C656272616369C3B36E21 |

+----+-----------------+--------------------------------+

3 rows in set (0.00 sec)

As you can see, while the column is still in BLOB, I have no problems reading or storing utf8 data into it. Now, after your application has been configured to use this slave and use utf8 connection, you can now convert the column and the table back to TEXT and utf8 character set.

slave> ALTER TABLE t CONVERT TO CHARACTER SET utf8, CHANGE c c TEXT;
slave> SET NAMES utf8;
slave> SELECT id, c, HEX(c) FROM t;                                 
+----+-----------------+--------------------------------+
| id | c               | HEX(c)                         |
+----+-----------------+--------------------------------+
|  3 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
|  4 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
|  5 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
+----+-----------------+--------------------------------+
3 rows in set (0.00 sec)

slave> ALTER TABLE t CONVERT TO CHARACTER SET utf8, CHANGE c c TEXT;

slave> SET NAMES utf8;

slave> SELECT id, c, HEX(c) FROM t;

+----+-----------------+--------------------------------+

| id | c | HEX(c) |

+----+-----------------+--------------------------------+

| 3 | ¡Celebración! | C2A143656C656272616369C3B36E21 |

| 4 | ¡Celebración! | C2A143656C656272616369C3B36E21 |

| 5 | ¡Celebración! | C2A143656C656272616369C3B36E21 |

+----+-----------------+--------------------------------+

3 rows in set (0.00 sec)

Some caveats though, you cannot replicate from BLOB or utf8 back to the latin1 column, so you will have to discard the data from the original master. Doing so will just result in double encoding. Second, while the column is in BLOB or any other BINARY type and this column is indexed, you may experience different results when the index is used. This is because BINARY data is indexed based on their numeric values per bytes not per character strings. Here is an example:

master> SHOW CREATE TABLE t G
*************************** 1. row ***************************
       Table: t
Create Table: CREATE TABLE `t` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `c` blob,
  PRIMARY KEY (`id`),
  KEY `c` (`c`(255))
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

master> SET NAMES latin1;
master> INSERT INTO t (c) VALUES ('¡Celebración!'), ('Férrêts being fërøcîóúß'), ('Voyage à Montreal');
master> SELECT c FROM t ORDER BY c;
+---------------------------------+
| c                               |
+---------------------------------+
| ¡Celebración!                   |
| Férrêts being fërøcîóúß         |
| Voyage à Montreal               |
+---------------------------------+
3 rows in set (0.00 sec)

master> ALTER TABLE t CHANGE c c BLOB;
master> SELECT c FROM t ORDER BY c;   
+---------------------------------+
| c                               |
+---------------------------------+
| Férrêts being fërøcîóúß         |
| Voyage à Montreal               |
| ¡Celebración!                   |
+---------------------------------+
3 rows in set (0.00 sec)

master> SHOW CREATE TABLE t G

*************************** 1. row ***************************

Table: t

Create Table: CREATE TABLE `t` (

`id` int(10) unsigned NOT NULL AUTO_INCREMENT,

`c` blob,

PRIMARY KEY (`id`),

KEY `c` (`c`(255))

) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=latin1

1 row in set (0.00 sec)

master> SET NAMES latin1;

master> INSERT INTO t (c) VALUES ('¡Celebración!'), ('Férrêts being fërøcîóúß'), ('Voyage à Montreal');

master> SELECT c FROM t ORDER BY c;

+---------------------------------+

| c |

+---------------------------------+

| ¡Celebración! |

| Férrêts being fërøcîóúß |

| Voyage à Montreal |

+---------------------------------+

3 rows in set (0.00 sec)

master> ALTER TABLE t CHANGE c c BLOB;

master> SELECT c FROM t ORDER BY c;

+---------------------------------+

| c |

+---------------------------------+

| Férrêts being fërøcîóúß |

| Voyage à Montreal |

| ¡Celebración! |

+---------------------------------+

3 rows in set (0.00 sec)

See how the results are now ordered differently?

What’s your utf8 horror? Share with us on the comments below 🙂

UPDATE: This was how the process looks like without downtime or extended table being blocked, but there are other ways. One of them is creating a copy of the original table converted to utf8 and doing an INSERT INTO .. SELECT using the CAST or CONVERT functions like below.

master> SHOW CREATE TABLE t G
*************************** 1. row ***************************
       Table: t
Create Table: CREATE TABLE `t` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `c` mediumtext,
  PRIMARY KEY (`id`),
  KEY `c` (`c`(255))
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

master> SHOW CREATE TABLE x G
*************************** 1. row ***************************
       Table: x
Create Table: CREATE TABLE `x` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `c` mediumtext,
  PRIMARY KEY (`id`),
  KEY `c` (`c`(255))
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

master> SET NAMES latin1;
master> SELECT * FROM t;
+----+-----------------+
| id | c               |
+----+-----------------+
|  1 | ¡Celebración!   |
|  2 | a               |
|  3 | A               |
|  4 | 東京            |
+----+-----------------+
4 rows in set (0.00 sec)

master> INSERT INTO x SELECT id, CONVERT(c USING BINARY) FROM t;
master> SELECT * FROM x;
+----+---------------+
| id | c             |
+----+---------------+
|  1 | ebraci  |
|  2 | a             |
|  3 | A             |
|  4 | ??            |
+----+---------------+
4 rows in set (0.00 sec)

master> SET NAMES utf8;
master> SELECT * FROM x;
+----+-----------------+
| id | c               |
+----+-----------------+
|  1 | ¡Celebración!   |
|  2 | a               |
|  3 | A               |
|  4 | 東京            |
+----+-----------------+
4 rows in set (0.00 sec)

master> SHOW CREATE TABLE t G

*************************** 1. row ***************************

Table: t

Create Table: CREATE TABLE `t` (

`id` int(10) unsigned NOT NULL AUTO_INCREMENT,

`c` mediumtext,

PRIMARY KEY (`id`),

KEY `c` (`c`(255))

) ENGINE=InnoDB DEFAULT CHARSET=latin1

1 row in set (0.00 sec)

master> SHOW CREATE TABLE x G

*************************** 1. row ***************************

Table: x

Create Table: CREATE TABLE `x` (

`id` int(10) unsigned NOT NULL AUTO_INCREMENT,

`c` mediumtext,

PRIMARY KEY (`id`),

KEY `c` (`c`(255))

) ENGINE=InnoDB DEFAULT CHARSET=utf8

1 row in set (0.00 sec)

master> SET NAMES latin1;

master> SELECT * FROM t;

+----+-----------------+

| id | c |

+----+-----------------+

| 1 | ¡Celebración! |

| 2 | a |

| 3 | A |

| 4 | 東京 |

+----+-----------------+

4 rows in set (0.00 sec)

master> INSERT INTO x SELECT id, CONVERT(c USING BINARY) FROM t;

master> SELECT * FROM x;

+----+---------------+

| id | c |

+----+---------------+

| 1 | ebraci |

| 2 | a |

| 3 | A |

| 4 | ?? |

+----+---------------+

4 rows in set (0.00 sec)

master> SET NAMES utf8;

master> SELECT * FROM x;

+----+-----------------+

| id | c |

+----+-----------------+

| 1 | ¡Celebración! |

| 2 | a |

| 3 | A |

| 4 | 東京 |

+----+-----------------+

4 rows in set (0.00 sec)

Another method is to copy the FRM file of the same table structure but in utf8 and replace your original table’s FRM file. Since the data is already stored as utf8, you should be able to read them on utf8 connection. However, you will have to rebuild you indexes based on affected columns as they are sorted as latin1 originally. In my tests though, there was no difference before and after rebuilding the index, so, YMMV. To demonstrate, still the same 2 previous tables – on the filesystem, I replaced t.frm with a copy of x.frm then did a FLUSH TABLES , afterwards, t looked like this:

master> SHOW CREATE TABLE t G
*************************** 1. row ***************************
       Table: t
Create Table: CREATE TABLE `t` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `c` mediumtext,
  PRIMARY KEY (`id`),
  KEY `c` (`c`(255))
) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

master> SHOW CREATE TABLE t G

*************************** 1. row ***************************

Table: t

Create Table: CREATE TABLE `t` (

`id` int(10) unsigned NOT NULL AUTO_INCREMENT,

`c` mediumtext,

PRIMARY KEY (`id`),

KEY `c` (`c`(255))

) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=utf8

1 row in set (0.00 sec)

Now, attempting to read the data on latin1 connection causes truncation:

master> SET NAMES latin1;         
master> SELECT * FROM t ORDER BY c;
+----+---------------+
| id | c             |
+----+---------------+
|  3 | a             |
|  4 | A             |
|  2 | ebraci  |
|  1 | ??            |
+----+---------------+
4 rows in set (0.00 sec)

master> SET NAMES latin1;

master> SELECT * FROM t ORDER BY c;

+----+---------------+

| id | c |

+----+---------------+

| 3 | a |

| 4 | A |

| 2 | ebraci |

| 1 | ?? |

+----+---------------+

4 rows in set (0.00 sec)

But on utf8, I am now able to read it fine:

master> SET NAMES utf8;
master> SELECT * FROM t ORDER BY c;
+----+-----------------+
| id | c               |
+----+-----------------+
|  3 | a               |
|  4 | A               |
|  2 | ¡Celebración!   |
|  1 | 東京            |
+----+-----------------+
4 rows in set (0.00 sec)

master> SET NAMES utf8;

master> SELECT * FROM t ORDER BY c;

+----+-----------------+

| id | c |

+----+-----------------+

| 3 | a |

| 4 | A |

| 2 | ¡Celebración! |

| 1 | 東京 |

+----+-----------------+

4 rows in set (0.00 sec)

Rebuilding the secondary key on c column has no difference on the results too.

master> ALTER TABLE t DROP KEY c, ADD KEY (c(255));
master> SELECT * FROM t ORDER BY c;                
+----+-----------------+
| id | c               |
+----+-----------------+
|  3 | a               |
|  4 | A               |
|  2 | ¡Celebración!   |
|  1 | 東京            |
+----+-----------------+
4 rows in set (0.00 sec)

master> ALTER TABLE t DROP KEY c, ADD KEY (c(255));

master> SELECT * FROM t ORDER BY c;

+----+-----------------+

| id | c |

+----+-----------------+

| 3 | a |

| 4 | A |

| 2 | ¡Celebración! |

| 1 | 東京 |

+----+-----------------+

4 rows in set (0.00 sec)

UPDATE: Apparently, the last method will not work for InnoDB tables because the character collation is stored in the data dictionary too as my colleague Alexander Rubin pointed out. But not all how is lost, you can still rebuild the table with pt-online-schema-change without blocking it.