Troubleshooting Issues with MySQL Character Sets Q & A

MySQL Character SetsIn this blog, I will provide answers to the Q & A for the Troubleshooting Issues with MySQL Character Sets webinar.

First, I want to thank everybody for attending the March 9 MySQL character sets troubleshooting webinar. The recording and slides for the webinar are available here. Below is the list of your questions that I wasn’t able to answer during the webinar, with responses:

Q: We’ve had some issues converting tables from utf8  to utf8mb4. Our issue was that the collation we wanted to use – utf8mb4_unicode_520_ci – did not distinguish between spaces and ideographic (Japanese) spaces, so we were getting unique constraint violations for the  varchar fields when two entries had the same text with different kinds of spaces. Have you seen this problem and is there a workaround? We were wondering if this was related to the mother-child character bug with this collation.

A: Unfortunately this issue exists for many languages. For example, in Russian you cannot distinguish “е” and “ё” if you use utf8 or utf8mb4. However, there is hope for Japanese: Oracle announced that they will implement new language-specific utf8mb4 collations in MySQL 8.0. I already see 21 new collations in my 8.0.0 installation.

In 8.0.1 they promised new case-sensitive and Japanese collations. Please see this blog post for details. The note about the planned Japanese support is at the end.

Meanwhile, I can only suggest that you implement your own collation as described here. You may use utf8_russian_ci collation from Bug #51976 as an example.

Although the user manual does not list utf8mb4 as a character set for which it’s possible to create new collations, you can actually do it. What you need to do is add a record about the character set utf8mb4 and the new collation into Index.xml, then restart the server.

Q: If receiving utf8 on latin1 charset