Migrating Database Charsets to utf8mb4: A Story from the Trenches

Migrating Database Charsets to utf8mb4: A Story from the Trenches

PREVIOUS POST
NEXT POST

utf8mb4In this blog post, we’ll look at options for migrating database charsets to utf8mb4.

Migrating charsets, in my opinion, is one of the most tedious tasks in a DBA’s life. There are so many things involved that can screw up our data, making it work is always hard. Sometimes what seems like a trivial task can become a nightmare very easily, and keeps us working for longer than expected.

I’ve recently worked on a case that challenged me with lots of tests due to some existing schema designs that made InnoDB suffer. I’ve decided to write this post to put together some definitive guide to enact charset conversion with minimal downtime and pain.

  • First disclosure: I can’t emphasize enough that you need to always backup your data. If something goes wrong, you can always roll things back by keeping a healthy set of backups.
  • Second disclosure: A backup can’t be considered a good backup until you test it, so I can’t emphasize enough that running regular backups and also performing regular restore tests is a must-to-do task for being in the safe side.
  • Third and last disclosure: I’m not pretending to present the best or only way to do this exercise. This is the way I consider easiest and painless to perform a charset conversion with minimal downtime.

My approach involves at least one slave for failover and logical/physical backup operations to make sure that data is loaded properly using the right charset.

In this case, we are moving from latin1 (default until MySQL 8.0.0) to utf8mb4 (new default from 8.0.1). In this post, Lefred refers to this change and some safety checks for upgrading. For our change, an important thing to consider: Latin1 charset stores one byte per character, while utf8mb4 can store up to four bytes per character. This change definitely impacts the disk usage, but also makes us hit some limits that I describe later in the plan.

So let’s put out hands in action. First, let’s create a slave using a fresh (non-locking) backup. Remember that these operations are designed to minimize downtime and reduce any potential impact on our production server.

If you already have a slave that can act as a master replacement then you can skip this section. In our source server, configure binlog_format and flush logs to start with fresh binary logs:

Start a streaming backup using Percona Xtrabackup through netcat in the destination server:

and in our source server:

Once the backup is done, untar and restore the backup. Then set up the slave:

Now that we have the slave ready, we prepare our dataset by running two mysqldump processes so we have data and schemas in separate files. You can also run this operation using MyDumper or mysqlpump, but I will keep it easy:

Write down this output, as it may be needed later:

Notice that I’m passing a command as an argument to –databases to dump all databases but mysql, performance_schema and information_schema (hack stolen from this post, with credit to Ronald Bradford).  It is very important to keep the replication stopped, as we will resume replication after fully converting our charset.

Now we have to convert our data to utf8mb4. This is easy as we just need to touch the schema.sql file by running few commands:

Can this be a one-liner? Yes, but I’m not a good basher. 🙂

Now we are ready to restore our data using new encoding:

Notice I’ve enabled the variable innodb_large_prefix. This is important because InnoDB limits index prefixes to 768 bytes by default. If you have an index based in a varchar(255) data type, you will get an error because the new charset exceeds this limit (up to four bytes per character goes beyond 1000 bytes) unless you limit the index prefix. To avoid issues during data load, we enable this variable to extend the limit to 3072 bytes.

Finally, let’s configure our server and restart it to make sure to set new defaults properly. In the my.cnf file, add:

Let’s resume replication after the restart, and make sure everything is ok:

Ok, at this point we should be fine and our data should be already converted to utf8mb4. So far so good. The next step is to failover applications to use the new server, and rebuild the old server using a fresh backup using xtrabackup as described above.

There are few things we need to consider now before converting this slave into master:

  1. Make sure you properly configured applications. Charset and collation values can be set as session level, so if you set your connection driver to another charset then you may end up mixing things in your data.
  2. Make sure the new slave is powerful enough to handle traffic from the master.
  3. Test everything before failing over production applications. Going from Latin1 to utf8mb4 should be straightforward, as utf8mb4 includes all the characters in Latin1. But let’s face it, things can go wrong and we are trying to avoid surprises.
  4. Last but not least, all procedures were done in a relatively small/medium sized dataset (around 600G). But this conversion (done via logical backups) is more difficult when talking about big databases (i.e., in the order of TBs). In these cases, the procedure helps but might not be good enough due to time restrictions (imagine loading a 1TB table from a logical dump — it take ages). If you happen to face such a conversion, here is a short, high-level plan:
    • Convert only smaller tables in the slave (i.e., those smaller than 500MB) following same procedure. Make sure to exclude big tables from the dump using the –ignore-tables parameter in mysqldump.
    • Convert bigger tables via alter table, as follows:
    • Once everything is finished, you can resume replication. Notice you can do dump/conversion/restore in parallel with the altering of bigger tables, which should reduce the time required for conversion.

It’s important to understand why we need the double conversion from latin1 to varbinary to utf8mb4. This post from Marco Tusa largely explains this.

Conclusion

I wrote this guide from my experience working with these type of projects. If you Google a bit, you’ll find a lot of resources that make this work, along with different solutions. What I’ve tried to present here is a guide to help you deal with these projects. Normally, we have to perform these changes in existing datasets that sometimes are big enough to prevent any work getting done via ALTER TABLE commands. Hopefully, you find this useful!

PREVIOUS POST
NEXT POST

Share this post

Comments (3)

  • Vishal Saxena Reply

    Good post. But in mysql galera cluster its not straight forward like this and tools like ‘pt-online-schema-change’ is required to complete the tasks slowly. Also many mysql installations are still running with old innodb_file_format “Antelope”. “innodb_large_prefix” does not work with old format. So if this is the case then all tables first needs to be converted to new format “Barracuda” with “row_format=Dynamic” and then innodb_large_prefix would work and let you complete the other steps. Mysqldump would work for master/slave type topologies but i think “pt-online-schema-change” is more effective for cluster type topologies where “redundancies” can not be built timely.

    April 11, 2018 at 10:57 am
  • Bernt Reply

    Hi,
    I assume the intent of this blog is to migrate correct latin1 data to utf8mb4 (as opposed to Marco Tusa’s post which is about having UTF8 encoded data in a latin1 column and how to resolve that).

    If my assumption is right, you should not use the intermediate step via VARBINARY since that will fail if you have non-ascii characters (latin1 characters > 0x7f) in your table. Like this:

    mysql> create table foo (v varchar(10) charset latin1);
    Query OK, 0 rows affected (0.02 sec)

    mysql> insert into foo values(‘å’);
    Query OK, 1 row affected (0.09 sec)

    mysql> select hex(v) from foo;
    +——–+
    | hex(v) |
    +——–+
    | E5 |
    +——–+
    1 row in set (0.00 sec)

    mysql> alter table foo modify column v varbinary(10);
    Query OK, 1 row affected (0.05 sec)
    Records: 1 Duplicates: 0 Warnings: 0

    mysql> alter table foo modify column v varchar(10) charset utf8mb4;
    ERROR 1366 (HY000): Incorrect string value: ‘\xE5’ for column ‘v’ at row 1
    mysql>

    When going via VARBINARY you don’t get the conversion to utf8mb4 that you want. The correct approach would be to go directly to utf8mb4 like this:

    mysql> create table foo (v varchar(10) charset latin1);
    Query OK, 0 rows affected (0.08 sec)

    mysql> insert into foo values(‘å’);
    Query OK, 1 row affected (0.10 sec)

    mysql> select hex(v) from foo;
    +——–+
    | hex(v) |
    +——–+
    | E5 |
    +——–+
    1 row in set (0.00 sec)

    mysql> alter table foo modify column v varchar(10) charset utf8mb4;
    Query OK, 1 row affected (0.07 sec)
    Records: 1 Duplicates: 0 Warnings: 0

    mysql> select hex(v) from foo;
    +——–+
    | hex(v) |
    +——–+
    | C3A5 |
    +——–+
    1 row in set (0.00 sec)

    0xE5 is the latin1 representation of ‘å’ and 0xC3A5 is the UTF-8 representation of the same character.

    Also, If this is done with MySQL 8.0, I would recommend using the default collation for utf8mb4 which is utf8mb4_0900_ai_ci.

    Bernt

    April 13, 2018 at 6:31 am
  • Vishal Saxena Reply

    Bernt ,

    Thanks for reply. I did not intend to first convert to “varbinary”. i did not understand as why “varchar” to “varbinary” conversion required ?. in my opinion this step ” alter table foo modify column v varchar(10) charset utf8mb4;” should work as mentioned regardless the character set of source data.It should work in “latin1” as well as “utf8”.

    April 13, 2018 at 12:04 pm

Leave a Reply