September 21, 2014

Replication checksums in MySQL 5.6

MySQL 5.6 has an impressive list of improvements. Among them, replication checksums caught my attention as it seems that many people misunderstand the real added value of this new feature. I heard people think that with replication checksums, data integrity between the master and its replicas is now enforced. As we’ll see, it’s not that easy.

First, here are a few common reasons why data integrity may be broken (the list is not exhaustive):

  • writes executed on a replica instead of the master
  • non deterministic queries
  • bad use of replication filters
  • rollback of transactions mixing transactional and non-transactional tables

The real problem is that replication may keep on running without error, not letting you know that all servers no longer have the same data.

Will replication checksums detect such problems? Unfortunately, no, they won’t.

That being said, the reasons I mentioned above are not the only ones that can break data integrity. One fairly common issue is binary log or relay log corruption. Will replication checksums help you in this case? Yes, they will!

Replication checksums on replicas

Checksums on slaves are controlled by the slave_sql_verify_checksum variable. Let’s first disable them:

We’ll now simulate a relay log corruption by manually editing it before the event is executed on the slave. The easiest way is to stop the SQL thread:

Now let’s write something on the master:

and corrupt the relay log (I changed sakila to sakilb):

Let’s restart the SQL thread and see what we have on the slave:

sakila on the master, sakilb on the slave: data integrity is broken, but SHOW SLAVE STATUS doesn’t show any problem. This is bad, but expected.

We’ll now turn on replication checksums on the slave:

and check that the master is writing checksums for binary log events:

We also want to discard our manually-edited relay log file (as editing such a file manually can lead to further corruption):

And after repeating the same steps as above (stop sql thread, create sakila2 database on the master, corrupt the relay log, restart the sql thread), we now have a nice error message:

Great! Now we’re warned that something went wrong. It’s also much easier to fix the problem right now (either manually or with pt-table-checksum/pt-table-sync) than several days or several weeks later, as a single inconsistency may easily translate to massive corruption over time.

If the error message looks too cryptic for you, you can use the new --verify-binlog-checksum option of mysqlbinlog:

Replication checksums on the master

We can enable them with the following command (they’re disabled by default):

Now if we have corruption in the binary log, we will still be able to write on the master. One way to show the corruption is to use SHOW BINLOG EVENTS:

But on slaves, the corruption will be obvious as replication will stop with an error:

Conclusion

You should not assume that replication checksums in MySQL 5.6 will ensure data integrity in all scenarios, but they do improve reliability of the whole replication process. This is a great step forward, thank you replication team at Oracle!

About Stephane Combaudon

Stéphane joined Percona in July 2012, after working as a MySQL DBA for leading French companies such as Dailymotion and France Telecom.

In real life, he lives in Paris with his wife and their twin daughters. When not in front of a computer or not spending time with his family, he likes playing chess and hiking.

Comments

  1. Nice feature, but it smells yet another way to slow down the SQL thread.

    Usually a corrupted relay log is quickly caught by the SQL thread, but I suppose if the right cosmic ray hit, it’d only permutate the data without corrupting the full record and this would definitely be a good way to catch that.

  2. Jay: Do we have any data about how much it solves the SQL Thread down ? Modern CPUs can compute crc32 in hardware at rate of gigabytes/sec so I would expect the overhead to be negligible though it would be good to examine sometime as a lot can depend on implementation.

    Stephane: Thanks for posting and to be frank I’m quite disappointed with error message which diagnoses checksum mismatch. I would really like to see the message to be split in several messages which clearly point to what is the issue instead of “one of 25 things could have went wrong”

  3. khan says:

    Just a side note, checksums can be implemented in 5.5 and earlier by using SSL replication as a workaround.

  4. Hello, Neat post. There’s а problem together with your site in internet explorer, could check this?
    IE nonetheless is the maretplace chief and a huge part
    of other ρwople will leave outt yοur fantasstic writіոg
    due to this problem.

  5. Kalyan says:

    replication failed. I/O thread is not working. I got the below error.

    2014-06-17 03:32:42 4694 [ERROR] Slave I/O: Replication event checksum verification failed while reading from network. Error_code: 1743
    2014-06-17 03:32:42 4694 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Error_code: 1595
    2014-06-17 03:32:42 4694 [Note] Slave I/O thread exiting, read up to log ‘*****-bin.00043′, position 142134355

    I used STOP SLAVE; START SLAVE;
    now the replication is working fine. can i assume that the replication is up to date? if not, what are the things i need to check?

Speak Your Mind

*