Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Migrating Database Charsets to utf8mb4

April 10, 2018

Author

Francisco Bordenave

Insight for DBAs

MySQL

Share this Post:

In this blog post, we’ll look at options for migrating database charsets to utf8mb4.

Migrating charsets, in my opinion, is one of the most tedious tasks in a DBA’s life. There are so many things involved that can screw up our data, making it work is always hard. Sometimes what seems like a trivial task can become a nightmare very easily, and keeps us working for longer than expected.

I’ve recently worked on a case that challenged me with lots of tests due to some existing schema designs that made InnoDB suffer. I’ve decided to write this post to put together some definitive guide to enact charset conversion with minimal downtime and pain.

- First disclosure: I can’t emphasize enough that you need to always backup your data. If something goes wrong, you can always roll things back by keeping a healthy set of backups.

- Second disclosure: A backup can’t be considered a good backup until you test it, so I can’t emphasize enough that running regular backups and also performing regular restore tests is a must-to-do task for being in the safe side.

- Third and last disclosure: I’m not pretending to present the best or only way to do this exercise. This is the way I consider easiest and painless to perform a charset conversion with minimal downtime.

My approach involves at least one slave for failover and logical/physical backup operations to make sure that data is loaded properly using the right charset.

Migrating Database Charsets to utf8mb4

In this case, we are moving from latin1 (default until MySQL 8.0.0) to utf8mb4 (new default from 8.0.1). In this post, Lefred refers to this change and some safety checks for upgrading. For our change, an important thing to consider: Latin1 charset stores one byte per character, while utf8mb4 can store up to four bytes per character. This change definitely impacts the disk usage, but also makes us hit some limits that I describe later in the plan.

So let’s put out hands in action. First, let’s create a slave using a fresh (non-locking) backup. Remember that these operations are designed to minimize downtime and reduce any potential impact on our production server.

If you already have a slave that can act as a master replacement then you can skip this section. In our source server, configure binlog_format and flush logs to start with fresh binary logs:

set global binlog_format=MIXED;
flush logs;

1 2	set global binlog_format=MIXED; flush logs;

Start a streaming backup using Percona Xtrabackup through netcat in the destination server:

nc -l 9999 | cat - > /dest/folder/backup.tar

1	nc -l 9999 \| cat - > /dest/folder/backup.tar

and in our source server:

innobackupex --stream=tar ./ | nc dest_server_ip 9999

1	innobackupex --stream=tar ./ \| nc dest_server_ip 9999

Once the backup is done, untar and restore the backup. Then set up the slave:

tar -xif /dest/folder/backup.tar
innobackupex --apply-log /dest/folder/
/etc/init.d/mysql stop
rm -rf /var/lib/mysql/
mv /dest/folder/* /var/lib/mysql/
chown -R mysql:mysql /var/lib/mysql
/etc/init.d/mysql start
cat /var/lib/mysql/xtrabackup_binlog_info
change master to master_host='master_host', master_user='master_user, master_password='master_password', master_log_file='file_printed_in_xtrabackup_binlog_info', master_log_pos=pos_printed_in_xtrabackup_binlog_info;
start slave;

tar -xif /dest/folder/backup.tar

innobackupex --apply-log /dest/folder/

/etc/init.d/mysql stop

rm -rf /var/lib/mysql/

mv /dest/folder/* /var/lib/mysql/

chown -R mysql:mysql /var/lib/mysql

/etc/init.d/mysql start

cat /var/lib/mysql/xtrabackup_binlog_info

change master to master_host='master_host', master_user='master_user, master_password='master_password', master_log_file='file_printed_in_xtrabackup_binlog_info', master_log_pos=pos_printed_in_xtrabackup_binlog_info;

start slave;

Now that we have the slave ready, we prepare our dataset by running two mysqldump processes so we have data and schemas in separate files. You can also run this operation using MyDumper or mysqlpump, but I will keep it easy:

STOP SLAVE;
SHOW SLAVE STATUS;

1 2	STOP SLAVE; SHOW SLAVE STATUS;

Write down this output, as it may be needed later:

mysqldump --skip-set-charset --no-data --databases `mysql --skip-column-names -e "SELECT GROUP_CONCAT(schema_name SEPARATOR ' ') FROM information_schema.schemata WHERE schema_name NOT IN ('mysql','performance_schema','information_schema');"` > schema.sql
mysqldump --skip-set-charset -n -t --databases `mysql --skip-column-names -e "SELECT GROUP_CONCAT(schema_name SEPARATOR ' ') FROM information_schema.schemata WHERE schema_name NOT IN ('mysql','performance_schema','information_schema');"` > data.sql

mysqldump --skip-set-charset --no-data --databases `mysql --skip-column-names -e "SELECT GROUP_CONCAT(schema_name SEPARATOR ' ') FROM information_schema.schemata WHERE schema_name NOT IN ('mysql','performance_schema','information_schema');"` > schema.sql

mysqldump --skip-set-charset -n -t --databases `mysql --skip-column-names -e "SELECT GROUP_CONCAT(schema_name SEPARATOR ' ') FROM information_schema.schemata WHERE schema_name NOT IN ('mysql','performance_schema','information_schema');"` > data.sql

Notice that I’m passing a command as an argument to –databases to dump all databases but mysql, performance_schema and information_schema (hack stolen from this post, with credit to Ronald Bradford). It is very important to keep the replication stopped, as we will resume replication after fully converting our charset.

Now we have to convert our data to utf8mb4. This is easy as we just need to touch the schema.sql file by running few commands:

sed -e -i "s/DEFAULT CHARACTER SET latin1/DEFAULT CHARACTER SET utf8mb4/g" schema.sql
sed -e -i "s/DEFAULT CHARSET=latin1/DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci /" schema.sql
sed -e -i "s/SET character_set_client = utf8/SET character_set_client = utf8mb4/" schema.sql

sed -e -i "s/DEFAULT CHARACTER SET latin1/DEFAULT CHARACTER SET utf8mb4/g" schema.sql

sed -e -i "s/DEFAULT CHARSET=latin1/DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci /" schema.sql

sed -e -i "s/SET character_set_client = utf8/SET character_set_client = utf8mb4/" schema.sql

Can this be a one-liner? Yes, but I’m not a good basher. 🙂

Now we are ready to restore our data using new encoding:

mysql -e "set global innodb_large_prefix=1;"
mysql < schema.sql
mysql < data.sql

mysql -e "set global innodb_large_prefix=1;"

mysql < schema.sql

mysql < data.sql

Notice I’ve enabled the variable innodb_large_prefix. This is important because InnoDB limits index prefixes to 768 bytes by default. If you have an index based in a varchar(255) data type, you will get an error because the new charset exceeds this limit (up to four bytes per character goes beyond 1000 bytes) unless you limit the index prefix. To avoid issues during data load, we enable this variable to extend the limit to 3072 bytes.

Finally, let’s configure our server and restart it to make sure to set new defaults properly. In the my.cnf file, add:

[client]
default-character-set=utf8mb4
[mysqld]
skip-slave-start
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
innodb_large_prefix=1

[client]

default-character-set=utf8mb4

[mysqld]

skip-slave-start

character-set-server=utf8mb4

collation-server=utf8mb4_unicode_ci

innodb_large_prefix=1

Let’s resume replication after the restart, and make sure everything is ok:

START SLAVE;
SHOW SLAVE STATUS;

1 2	START SLAVE; SHOW SLAVE STATUS;

Ok, at this point we should be fine and our data should be already converted to utf8mb4. So far so good. The next step is to failover applications to use the new server, and rebuild the old server using a fresh backup using xtrabackup as described above.

There are few things we need to consider now before converting this slave into master:

1. Make sure you properly configured applications. Charset and collation values can be set as session level, so if you set your connection driver to another charset then you may end up mixing things in your data.

1. Make sure the new slave is powerful enough to handle traffic from the master.

1. Test everything before failing over production applications. Going from Latin1 to utf8mb4 should be straightforward, as utf8mb4 includes all the characters in Latin1. But let’s face it, things can go wrong and we are trying to avoid surprises.

1. Last but not least, all procedures were done in a relatively small/medium sized dataset (around 600G). But this conversion (done via logical backups) is more difficult when talking about big databases (i.e., in the order of TBs). In these cases, the procedure helps but might not be good enough due to time restrictions (imagine loading a 1TB table from a logical dump — it take ages). If you happen to face such a conversion, here is a short, high-level plan:

- - - Convert only smaller tables in the slave (i.e., those smaller than 500MB) following same procedure. Make sure to exclude big tables from the dump using the –ignore-tables parameter in mysqldump.
  - - Convert bigger tables via alter table, as follows:
      
      MySQL
      
      ALTER TABLE big_table MODIFY latin1_column varbinary(250); ALTER TABLE big_table MODIFY latin1_column varchar(250) CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
      
      1
      2
      
      ALTER TABLE big_table MODIFY latin1_column varbinary(250);
      ALTER TABLE big_table MODIFY latin1_column varchar(250) CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
  - - Once everything is finished, you can resume replication. Notice you can do dump/conversion/restore in parallel with the altering of bigger tables, which should reduce the time required for conversion.

It’s important to understand why we need the double conversion from latin1 to varbinary to utf8mb4. This post from Marco Tusa largely explains this.

Conclusion

I wrote this guide from my experience working with these type of projects. If you Google a bit, you’ll find a lot of resources that make this work, along with different solutions. What I’ve tried to present here is a guide to help you deal with these projects. Normally, we have to perform these changes in existing datasets that sometimes are big enough to prevent any work getting done via ALTER TABLE commands. Hopefully, you find this useful!

When it comes to choosing the right database for your needs, you want to ensure you’re making the best decision for your company. Read our white paper on which database to choose for guidance on this decision. The paper compares basic use cases for MongoDB®, MySQL®, and PostgreSQL®, three of the most popular open source database options available today.

If you’re looking for a way to ensure database replication and multi-master capabilities for your on-premise architecture, read our solution brief: Always Online, Critical Website with Percona XtraDB Cluster. The brief outlines how to achieve this solution using a Percona XtraDB Cluster on-premises architecture that is built upon Percona Server for MySQL and Percona’s enhanced Codership Galera library.

0 0 votes

Article Rating

4 Comments

Oldest

Newest Most Voted

Vishal Saxena

8 years ago

Good post. But in mysql galera cluster its not straight forward like this and tools like ‘pt-online-schema-change’ is required to complete the tasks slowly. Also many mysql installations are still running with old innodb_file_format “Antelope”. “innodb_large_prefix” does not work with old format. So if this is the case then all tables first needs to be converted to new format “Barracuda” with “row_format=Dynamic” and then innodb_large_prefix would work and let you complete the other steps. Mysqldump would work for master/slave type topologies but i think “pt-online-schema-change” is more effective for cluster type topologies where “redundancies” can not be built timely.

Bernt

8 years ago

Hi,
I assume the intent of this blog is to migrate correct latin1 data to utf8mb4 (as opposed to Marco Tusa’s post which is about having UTF8 encoded data in a latin1 column and how to resolve that).

If my assumption is right, you should not use the intermediate step via VARBINARY since that will fail if you have non-ascii characters (latin1 characters > 0x7f) in your table. Like this:

mysql> create table foo (v varchar(10) charset latin1);
Query OK, 0 rows affected (0.02 sec)

mysql> insert into foo values(‘å’);
Query OK, 1 row affected (0.09 sec)

mysql> select hex(v) from foo;
+——–+
| hex(v) |
+——–+
| E5 |
+——–+
1 row in set (0.00 sec)

mysql> alter table foo modify column v varbinary(10);
Query OK, 1 row affected (0.05 sec)
Records: 1 Duplicates: 0 Warnings: 0

mysql> alter table foo modify column v varchar(10) charset utf8mb4;
ERROR 1366 (HY000): Incorrect string value: ‘xE5’ for column ‘v’ at row 1
mysql>

When going via VARBINARY you don’t get the conversion to utf8mb4 that you want. The correct approach would be to go directly to utf8mb4 like this:

mysql> create table foo (v varchar(10) charset latin1);
Query OK, 0 rows affected (0.08 sec)

mysql> insert into foo values(‘å’);
Query OK, 1 row affected (0.10 sec)

mysql> select hex(v) from foo;
+——–+
| hex(v) |
+——–+
| E5 |
+——–+
1 row in set (0.00 sec)

mysql> alter table foo modify column v varchar(10) charset utf8mb4;
Query OK, 1 row affected (0.07 sec)
Records: 1 Duplicates: 0 Warnings: 0

mysql> select hex(v) from foo;
+——–+
| hex(v) |
+——–+
| C3A5 |
+——–+
1 row in set (0.00 sec)

0xE5 is the latin1 representation of ‘å’ and 0xC3A5 is the UTF-8 representation of the same character.

Also, If this is done with MySQL 8.0, I would recommend using the default collation for utf8mb4 which is utf8mb4_0900_ai_ci.

Bernt

Vishal Saxena

8 years ago

Bernt ,

Thanks for reply. I did not intend to first convert to “varbinary”. i did not understand as why “varchar” to “varbinary” conversion required ?. in my opinion this step ” alter table foo modify column v varchar(10) charset utf8mb4;” should work as mentioned regardless the character set of source data.It should work in “latin1” as well as “utf8”.