In this blog post, we’ll look at options for migrating database charsets to utf8mb4.
Migrating charsets, in my opinion, is one of the most tedious tasks in a DBA’s life. There are so many things involved that can screw up our data, making it work is always hard. Sometimes what seems like a trivial task can become a nightmare very easily, and keeps us working for longer than expected.
I’ve recently worked on a case that challenged me with lots of tests due to some existing schema designs that made InnoDB suffer. I’ve decided to write this post to put together some definitive guide to enact charset conversion with minimal downtime and pain.
My approach involves at least one slave for failover and logical/physical backup operations to make sure that data is loaded properly using the right charset.
In this case, we are moving from latin1 (default until MySQL 8.0.0) to utf8mb4 (new default from 8.0.1). In this post, Lefred refers to this change and some safety checks for upgrading. For our change, an important thing to consider: Latin1 charset stores one byte per character, while utf8mb4 can store up to four bytes per character. This change definitely impacts the disk usage, but also makes us hit some limits that I describe later in the plan.
So let’s put out hands in action. First, let’s create a slave using a fresh (non-locking) backup. Remember that these operations are designed to minimize downtime and reduce any potential impact on our production server.
If you already have a slave that can act as a master replacement then you can skip this section. In our source server, configure binlog_format and flush logs to start with fresh binary logs:
|
1 2 |
set global binlog_format=MIXED; flush logs; |
Start a streaming backup using Percona Xtrabackup through netcat in the destination server:
|
1 |
nc -l 9999 | cat - > /dest/folder/backup.tar |
and in our source server:
|
1 |
innobackupex --stream=tar ./ | nc dest_server_ip 9999 |
Once the backup is done, untar and restore the backup. Then set up the slave:
|
1 2 3 4 5 6 7 8 9 10 |
tar -xif /dest/folder/backup.tar innobackupex --apply-log /dest/folder/ /etc/init.d/mysql stop rm -rf /var/lib/mysql/ mv /dest/folder/* /var/lib/mysql/ chown -R mysql:mysql /var/lib/mysql /etc/init.d/mysql start cat /var/lib/mysql/xtrabackup_binlog_info change master to master_host='master_host', master_user='master_user, master_password='master_password', master_log_file='file_printed_in_xtrabackup_binlog_info', master_log_pos=pos_printed_in_xtrabackup_binlog_info; start slave; |
Now that we have the slave ready, we prepare our dataset by running two mysqldump processes so we have data and schemas in separate files. You can also run this operation using MyDumper or mysqlpump, but I will keep it easy:
|
1 2 |
STOP SLAVE; SHOW SLAVE STATUS; |
Write down this output, as it may be needed later:
|
1 2 |
mysqldump --skip-set-charset --no-data --databases `mysql --skip-column-names -e "SELECT GROUP_CONCAT(schema_name SEPARATOR ' ') FROM information_schema.schemata WHERE schema_name NOT IN ('mysql','performance_schema','information_schema');"` > schema.sql mysqldump --skip-set-charset -n -t --databases `mysql --skip-column-names -e "SELECT GROUP_CONCAT(schema_name SEPARATOR ' ') FROM information_schema.schemata WHERE schema_name NOT IN ('mysql','performance_schema','information_schema');"` > data.sql |
Notice that I’m passing a command as an argument to –databases to dump all databases but mysql, performance_schema and information_schema (hack stolen from this post, with credit to Ronald Bradford). It is very important to keep the replication stopped, as we will resume replication after fully converting our charset.
Now we have to convert our data to utf8mb4. This is easy as we just need to touch the schema.sql file by running few commands:
|
1 2 3 |
sed -e -i "s/DEFAULT CHARACTER SET latin1/DEFAULT CHARACTER SET utf8mb4/g" schema.sql sed -e -i "s/DEFAULT CHARSET=latin1/DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci /" schema.sql sed -e -i "s/SET character_set_client = utf8/SET character_set_client = utf8mb4/" schema.sql |
Can this be a one-liner? Yes, but I’m not a good basher. 🙂
Now we are ready to restore our data using new encoding:
|
1 2 3 |
mysql -e "set global innodb_large_prefix=1;" mysql < schema.sql mysql < data.sql |
Notice I’ve enabled the variable innodb_large_prefix. This is important because InnoDB limits index prefixes to 768 bytes by default. If you have an index based in a varchar(255) data type, you will get an error because the new charset exceeds this limit (up to four bytes per character goes beyond 1000 bytes) unless you limit the index prefix. To avoid issues during data load, we enable this variable to extend the limit to 3072 bytes.
Finally, let’s configure our server and restart it to make sure to set new defaults properly. In the my.cnf file, add:
|
1 2 3 4 5 6 7 |
[client] default-character-set=utf8mb4 [mysqld] skip-slave-start character-set-server=utf8mb4 collation-server=utf8mb4_unicode_ci innodb_large_prefix=1 |
Let’s resume replication after the restart, and make sure everything is ok:
|
1 2 |
START SLAVE; SHOW SLAVE STATUS; |
Ok, at this point we should be fine and our data should be already converted to utf8mb4. So far so good. The next step is to failover applications to use the new server, and rebuild the old server using a fresh backup using xtrabackup as described above.
There are few things we need to consider now before converting this slave into master:
|
1 2 |
ALTER TABLE big_table MODIFY latin1_column varbinary(250); ALTER TABLE big_table MODIFY latin1_column varchar(250) CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci; |
It’s important to understand why we need the double conversion from latin1 to varbinary to utf8mb4. This post from Marco Tusa largely explains this.
I wrote this guide from my experience working with these type of projects. If you Google a bit, you’ll find a lot of resources that make this work, along with different solutions. What I’ve tried to present here is a guide to help you deal with these projects. Normally, we have to perform these changes in existing datasets that sometimes are big enough to prevent any work getting done via ALTER TABLE commands. Hopefully, you find this useful!
When it comes to choosing the right database for your needs, you want to ensure you’re making the best decision for your company. Read our white paper on which database to choose for guidance on this decision. The paper compares basic use cases for MongoDB®, MySQL®, and PostgreSQL®, three of the most popular open source database options available today.
If you’re looking for a way to ensure database replication and multi-master capabilities for your on-premise architecture, read our solution brief: Always Online, Critical Website with Percona XtraDB Cluster. The brief outlines how to achieve this solution using a Percona XtraDB Cluster on-premises architecture that is built upon Percona Server for MySQL and Percona’s enhanced Codership Galera library.
Good post. But in mysql galera cluster its not straight forward like this and tools like ‘pt-online-schema-change’ is required to complete the tasks slowly. Also many mysql installations are still running with old innodb_file_format “Antelope”. “innodb_large_prefix” does not work with old format. So if this is the case then all tables first needs to be converted to new format “Barracuda” with “row_format=Dynamic” and then innodb_large_prefix would work and let you complete the other steps. Mysqldump would work for master/slave type topologies but i think “pt-online-schema-change” is more effective for cluster type topologies where “redundancies” can not be built timely.
Hi,
I assume the intent of this blog is to migrate correct latin1 data to utf8mb4 (as opposed to Marco Tusa’s post which is about having UTF8 encoded data in a latin1 column and how to resolve that).
If my assumption is right, you should not use the intermediate step via VARBINARY since that will fail if you have non-ascii characters (latin1 characters > 0x7f) in your table. Like this:
mysql> create table foo (v varchar(10) charset latin1);
Query OK, 0 rows affected (0.02 sec)
mysql> insert into foo values(‘å’);
Query OK, 1 row affected (0.09 sec)
mysql> select hex(v) from foo;
+——–+
| hex(v) |
+——–+
| E5 |
+——–+
1 row in set (0.00 sec)
mysql> alter table foo modify column v varbinary(10);
Query OK, 1 row affected (0.05 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> alter table foo modify column v varchar(10) charset utf8mb4;
ERROR 1366 (HY000): Incorrect string value: ‘xE5’ for column ‘v’ at row 1
mysql>
When going via VARBINARY you don’t get the conversion to utf8mb4 that you want. The correct approach would be to go directly to utf8mb4 like this:
mysql> create table foo (v varchar(10) charset latin1);
Query OK, 0 rows affected (0.08 sec)
mysql> insert into foo values(‘å’);
Query OK, 1 row affected (0.10 sec)
mysql> select hex(v) from foo;
+——–+
| hex(v) |
+——–+
| E5 |
+——–+
1 row in set (0.00 sec)
mysql> alter table foo modify column v varchar(10) charset utf8mb4;
Query OK, 1 row affected (0.07 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> select hex(v) from foo;
+——–+
| hex(v) |
+——–+
| C3A5 |
+——–+
1 row in set (0.00 sec)
0xE5 is the latin1 representation of ‘å’ and 0xC3A5 is the UTF-8 representation of the same character.
Also, If this is done with MySQL 8.0, I would recommend using the default collation for utf8mb4 which is utf8mb4_0900_ai_ci.
Bernt
Bernt ,
Thanks for reply. I did not intend to first convert to “varbinary”. i did not understand as why “varchar” to “varbinary” conversion required ?. in my opinion this step ” alter table foo modify column v varchar(10) charset utf8mb4;” should work as mentioned regardless the character set of source data.It should work in “latin1” as well as “utf8”.
Hi,
after drying your given sed command i noticed that you mixed up parameter.
It works only with
sed -i -e “…”
Greetings
Thomas