September 1, 2014

CentOS 5.8 users: your UTF-8 data is in peril with Perl MySQL

CentOS 5.8 and earlier use Perl module DBD::mysql v3.0007 which has a bug that causes Perl not to flag UTF-8 data as being UTF-8.  Presuming that the MySQL table/column is using UTF-8, and the Perl MySQL connection is also using UTF-8, then a correct system returns:

PV = 0x9573840 "\343\203\213 \303\250"\0 [UTF8 "\x{30cb} \x{e8}"]

That’s a Devel::Peek inside a Perl scalar variable which clearly shows that Perl has recognized and flagged the data at UTF-8.  With DBD::mysql v3.0007 however, an incorrect system returns:

PV = 0x92df9a8 "\343\203\213 \303\250"\0

Notice that it’s the same byte sequence (in octal), but there’s no UTF-8 flag.  As far as Perl is concerned, this is Latin1 data.

What does this mean for you?  In general, it means that Perl could corrupt the data by treating UTF-8 data as Latin1 data.  If the program doesn’t alter the data, then the problem is “overlooked” and compensated for by the fact that MySQL knows that the data is UTF-8 and treats it as such.  We have found, however, that a program can modify the data without corrupting it, but this is risky and really only works by luck, so you shouldn’t rely on it.

I’d like to clarifying two things.  First, DBD::mysql v3.0007 was released in September 2006, but this very old problem still exists today because CentOS 5 is still a popular Linux distro.  So this isn’t “breaking news”, and Perl and DBD::mysql have handled UTF-8 correctly for nearly the last decade.   Second, just a reminder: all Percona Toolkit tools that connect to MySQL have a –charset option and an “A” DSN option for setting the character set.

In conclusion, if you

  1. Run CentOS 5
  2. Have UTF-8 data in MySQL
  3. Use Perl to access that data
  4. Have not upgraded DBD::mysql (perl-DBD-MySQL)

then your UTF-8 data is in peril with Perl.  The solution is simple: upgrade to any newer version of DBD::mysql (except 4.014).

Comments

  1. Jeff says:

    So really you’re saying this is a bug with any RHEL 5 based distro (CentOS, Oracle, Scientific, etc.)? Is there an upstream (RHEL) bug for this issue that you’re aware of?

  2. Jeff says:

    Also, you say “5.8 and earlier”, but I think this also applies to current 5.9 releases since I don’t see a perl-DBD-MySQL update since 2008.

  3. Jeff,

    I’m not aware of an upstream (RHEL) bug for this, but I haven’t looked. I have often wondered why DBD::mysql 3.0007_1, _2, or 3.0008 isn’t provided, in which the bug is fixed, for 5.8 since those versions should be completely compatible with 3.0007 (though, granted, the _1 and _2 are dev releases).

    As for CentOS 5.9, it may still be using DBD::mysql 3.0007 too; I didn’t check (we test on CentOS 5.8). To see, run:

    perl -MDBD::mysql -e ‘print “$DBD::mysql::VERSION\n”‘

    In fact, the problem is OS-agnostic: it applies anywhere DBD::mysql 3.0007 is used. I single out CentOS 5 because it’s a popular platform.

Speak Your Mind

*