How to avoid hash collisions when using MySQL’s CRC32 function

October 13, 2014
Author
Arunjith Aravindan
Share this Post:

Percona Toolkit’s  pt-table-checksum performs an online replication consistency check by executing checksum queries on the master, which produces different results on replicas that are inconsistent with the master – and the tool pt-table-sync synchronizes data efficiently between MySQL tables.

The tools by default use the CRC32. Other good choices include MD5 and SHA1. If you have installed the FNV_64 user-defined function, pt-table-sync will detect it and prefer to use it, because it is much faster than the built-ins. You can also use MURMUR_HASH if you’ve installed that user-defined function. Both of these are distributed with Maatkit. For details please see the tool’s documentation.

Below are test cases similar to what you might have encountered. By using the table checksum we can confirm that the two tables are identical and useful to verify a slave server is in sync with its master. The following test cases with pt-table-checksum and pt-table-sync will help you use the tools more accurately.

For example, in a master-slave setup we have a table with a primary key on column “a” and a unique key on column “b”. Here the master and slave tables are not in sync and the tables are having two identical values and two distinct values. The pt-table-checksum tool should be able to identify the difference between master and slave and the pt-table-sync in this case should sync the tables with two REPLACE queries.

Case 1:  Non-cryptographic Hash function (CRC32) and the Hash collision.

The tables in the source and target have two different columns and in general way of thinking the tools should identify the difference. But the below scenarios explain how the tools can be wrongly used and how to avoid them – and make things more consistent and reliable when using the tools in your production.

The tools by default use the CRC32 checksums and it is prone to hash collisions. In the below case the non-cryptographic function (CRC32) is not able to identify the two distinct values as the function generates the same value even we are having the distinct values in the tables.

Narrowed down to BIT_XOR:

Case 2: As the tools are not able to identify the difference, let us add a new row to the slave and check if the tools are able to identify the distinct values. So I am adding a new row (5,5) to the slave.

Well, apparently the tools are now able to identify the newly added row in the slave and the two other rows having the difference.

Case 3: Advantage of Cryptographic Hash functions (Ex: Secure MD5)

As such let us make the tables as in the case1 and ask the tools to use the cryptographic (secure MD5) hash functions instead the usual non-cryptographic function. The default CRC32 function provides no security due to their simple mathematical structure and too prone to hash collisions but the MD5 provides better level of integrity. So let us try with the –function=md5 and see the result.

Narrowed down to BIT_XOR:

The MD5 did the trick and solved the problem. See the BIT_XOR result for the MD5 given above and the function is able to identify the distinct values in the tables and resulted with the different crc values. The MD5 (Message-Digest algorithm 5) is a well-known cryptographic hash function with a 128-bit resulting hash value. MD5 is widely used in security-related applications, and is also frequently used to check the integrity but MD5() and SHA1() are very CPU-intensive with slower checksumming if chunk-time is included.

 

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Far
Enough.

Said no pioneer ever.
MySQL, PostgreSQL, InnoDB, MariaDB, MongoDB and Kubernetes are trademarks for their respective owners.
© 2026 Percona All Rights Reserved