Replication checksums in MySQL 5.6

MySQL 5.6 has an impressive list of improvements. Among them, replication checksums caught my attention as it seems that many people misunderstand the real added value of this new feature. I heard people think that with replication checksums, data integrity between the master and its replicas is now enforced. As we’ll see, it’s not that easy.

First, here are a few common reasons why data integrity may be broken (the list is not exhaustive):

  • writes executed on a replica instead of the master
  • non deterministic queries
  • bad use of replication filters
  • rollback of transactions mixing transactional and non-transactional tables

The real problem is that replication may keep on running without error, not letting you know that all servers no longer have the same data.

Will replication checksums detect such problems? Unfortunately, no, they won’t.

That being said, the reasons I mentioned above are not the only ones that can break data integrity. One fairly common issue is binary log or relay log corruption. Will replication checksums help you in this case? Yes, they will!

Replication checksums on replicas

Checksums on slaves are controlled by the slave_sql_verify_checksum variable. Let’s first disable them:

We’ll now simulate a relay log corruption by manually editing it before the event is executed on the slave. The easiest way is to stop the SQL thread:

Now let’s write something on the master:

and corrupt the relay log (I changed sakila to sakilb):

Let’s restart the SQL thread and see what we have on the slave:

sakila on the master, sakilb on the slave: data integrity is broken, but SHOW SLAVE STATUS doesn’t show any problem. This is bad, but expected.

We’ll now turn on replication checksums on the slave:

and check that the master is writing checksums for binary log events:

We also want to discard our manually-edited relay log file (as editing such a file manually can lead to further corruption):

And after repeating the same steps as above (stop sql thread, create sakila2 database on the master, corrupt the relay log, restart the sql thread), we now have a nice error message:

Great! Now we’re warned that something went wrong. It’s also much easier to fix the problem right now (either manually or with pt-table-checksum/pt-table-sync) than several days or several weeks later, as a single inconsistency may easily translate to massive corruption over time.

If the error message looks too cryptic for you, you can use the new --verify-binlog-checksum option of mysqlbinlog:

Replication checksums on the master

We can enable them with the following command (they’re disabled by default):

Now if we have corruption in the binary log, we will still be able to write on the master. One way to show the corruption is to use SHOW BINLOG EVENTS:

But on slaves, the corruption will be obvious as replication will stop with an error: