How to avoid hash collisions when using MySQL’s CRC32 function

PREVIOUS POST
NEXT POST

Percona Toolkit’s  pt-table-checksum performs an online replication consistency check by executing checksum queries on the master, which produces different results on replicas that are inconsistent with the master – and the tool pt-table-sync synchronizes data efficiently between MySQL tables.

The tools by default use the CRC32. Other good choices include MD5 and SHA1. If you have installed the FNV_64 user-defined function, pt-table-sync will detect it and prefer to use it, because it is much faster than the built-ins. You can also use MURMUR_HASH if you’ve installed that user-defined function. Both of these are distributed with Maatkit. For details please see the tool’s documentation.

Below are test cases similar to what you might have encountered. By using the table checksum we can confirm that the two tables are identical and useful to verify a slave server is in sync with its master. The following test cases with pt-table-checksum and pt-table-sync will help you use the tools more accurately.

For example, in a master-slave setup we have a table with a primary key on column “a” and a unique key on column “b”. Here the master and slave tables are not in sync and the tables are having two identical values and two distinct values. The pt-table-checksum tool should be able to identify the difference between master and slave and the pt-table-sync in this case should sync the tables with two REPLACE queries.

Case 1:  Non-cryptographic Hash function (CRC32) and the Hash collision.

The tables in the source and target have two different columns and in general way of thinking the tools should identify the difference. But the below scenarios explain how the tools can be wrongly used and how to avoid them – and make things more consistent and reliable when using the tools in your production.

The tools by default use the CRC32 checksums and it is prone to hash collisions. In the below case the non-cryptographic function (CRC32) is not able to identify the two distinct values as the function generates the same value even we are having the distinct values in the tables.

Narrowed down to BIT_XOR:

Case 2: As the tools are not able to identify the difference, let us add a new row to the slave and check if the tools are able to identify the distinct values. So I am adding a new row (5,5) to the slave.

Well, apparently the tools are now able to identify the newly added row in the slave and the two other rows having the difference.

Case 3: Advantage of Cryptographic Hash functions (Ex: Secure MD5)

As such let us make the tables as in the case1 and ask the tools to use the cryptographic (secure MD5) hash functions instead the usual non-cryptographic function. The default CRC32 function provides no security due to their simple mathematical structure and too prone to hash collisions but the MD5 provides better level of integrity. So let us try with the –function=md5 and see the result.

Narrowed down to BIT_XOR:

The MD5 did the trick and solved the problem. See the BIT_XOR result for the MD5 given above and the function is able to identify the distinct values in the tables and resulted with the different crc values. The MD5 (Message-Digest algorithm 5) is a well-known cryptographic hash function with a 128-bit resulting hash value. MD5 is widely used in security-related applications, and is also frequently used to check the integrity but MD5() and SHA1() are very CPU-intensive with slower checksumming if chunk-time is included.

 

PREVIOUS POST
NEXT POST

Share this post

Comments (2)

  • Matthew Boehm Reply

    “Both of these are distributed with Maatkit.”

    In fact, all 3 (FNV1A, FNV_64, MURMUR_HASH) are distributed with Percona Server. Most people just don’t bother installing them even though it is recommended as a post-install step when installing via yum.

    Also, pt-table-checksum doesn’t detect and prefer the other functions. It searches for them in this order: CRC32 FNV1A_64 FNV_64 MURMUR_HASH MD5 SHA1 and uses the first one it finds. So, unfortunately, even if you have the others installed, unless you specify one, pt-t-c will always use CRC32. I submitted a bug/FR on this a while back (2012!) (https://bugs.launchpad.net/percona-toolkit/+bug/1059732) and was overrulled on changing the preference.

    October 14, 2014 at 12:20 pm
  • Chuck Mire Reply

    I wrote a CRC32 program that will calculate the CRC32 value for either a file or a string. My program has a unique feature in that you can use either the normal seed value of EDB88320 Hex to compute the CRC32 value AND/OR you can use any other seed you want. IF you have a hash collision with the normal seed value, you will NOT have a hash collision on that same file or string when a different seed is used. So just keep track of more than one CRC32 value using a different seed.

    My original DOS version as well as a standalone Windows executable (with source code) is available here:

    http://qb45.org/files.php?action=search2&keywords=crc32&funcbtn=Search%21

    August 6, 2016 at 7:58 pm

Leave a Reply