Downloads

Blog

Replication checksums in MySQL 5.6

February 15, 2013

Author

Stephane Combaudon

MySQL

Share this Post:

MySQL 5.6 has an impressive list of improvements. Among them, replication checksums caught my attention as it seems that many people misunderstand the real added value of this new feature. I heard people think that with replication checksums, data integrity between the master and its replicas is now enforced. As we’ll see, it’s not that easy.

First, here are a few common reasons why data integrity may be broken (the list is not exhaustive):

- writes executed on a replica instead of the master

- non deterministic queries

- bad use of replication filters

- rollback of transactions mixing transactional and non-transactional tables

The real problem is that replication may keep on running without error, not letting you know that all servers no longer have the same data.

Will replication checksums detect such problems? Unfortunately, no, they won’t.

That being said, the reasons I mentioned above are not the only ones that can break data integrity. One fairly common issue is binary log or relay log corruption. Will replication checksums help you in this case? Yes, they will!

Replication checksums on replicas

Checksums on slaves are controlled by the slave_sql_verify_checksum variable. Let’s first disable them:

slave1 [localhost] {msandbox} ((none)) > set global slave_sql_verify_checksum=OFF;
Query OK, 0 rows affected (0,00 sec)

1 2	slave1 [localhost] {msandbox} ((none)) > set global slave_sql_verify_checksum=OFF; Query OK, 0 rows affected (0,00 sec)

We’ll now simulate a relay log corruption by manually editing it before the event is executed on the slave. The easiest way is to stop the SQL thread:

slave1 [localhost] {msandbox} ((none)) > stop slave sql_thread;
Query OK, 0 rows affected (0,00 sec)

1 2	slave1 [localhost] {msandbox} ((none)) > stop slave sql_thread; Query OK, 0 rows affected (0,00 sec)

Now let’s write something on the master:

master [localhost] {msandbox} ((none)) > create database sakila;
Query OK, 1 row affected (0,00 sec)

1 2	master [localhost] {msandbox} ((none)) > create database sakila; Query OK, 1 row affected (0,00 sec)

and corrupt the relay log (I changed sakila to sakilb):

[...]
@^@^@^F^Cstd^D!^@!^@^H^@^L^Asakila^@sakila^@create database sakilb5      Ý

1 2	[...] @^@^@^F^Cstd^D!^@!^@^H^@^L^Asakila^@sakila^@create database sakilb5 Ý

Let’s restart the SQL thread and see what we have on the slave:

slave1 [localhost] {msandbox} ((none)) > start slave sql_thread;
Query OK, 0 rows affected (0,01 sec)

slave1 [localhost] {msandbox} ((none)) > show databases like 'sakil%';
+-------------------+
| Database (sakil%) |
+-------------------+
| sakilb            |
+-------------------+

slave1 [localhost] {msandbox} ((none)) > start slave sql_thread;

Query OK, 0 rows affected (0,01 sec)

slave1 [localhost] {msandbox} ((none)) > show databases like 'sakil%';

+-------------------+

| Database (sakil%) |

+-------------------+

| sakilb |

+-------------------+

sakila on the master, sakilb on the slave: data integrity is broken, but SHOW SLAVE STATUS doesn’t show any problem. This is bad, but expected.

We’ll now turn on replication checksums on the slave:

slave1 [localhost] {msandbox} ((none)) > set global slave_sql_verify_checksum=ON;
Query OK, 0 rows affected (0,00 sec)

1 2	slave1 [localhost] {msandbox} ((none)) > set global slave_sql_verify_checksum=ON; Query OK, 0 rows affected (0,00 sec)

and check that the master is writing checksums for binary log events:

master [localhost] {msandbox} ((none)) > show global variables like 'binlog_checksum';
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| binlog_checksum | CRC32 |
+-----------------+-------+

master [localhost] {msandbox} ((none)) > show global variables like 'binlog_checksum';

+-----------------+-------+

| Variable_name | Value |

+-----------------+-------+

| binlog_checksum | CRC32 |

+-----------------+-------+

We also want to discard our manually-edited relay log file (as editing such a file manually can lead to further corruption):

slave1 [localhost] {msandbox} ((none)) > stop slave;
Query OK, 0 rows affected (0,01 sec)

# Use Relay_Master_Log_File and Exec_Master_Log_Pos from SHOW SLAVE STATUS
slave1 [localhost] {msandbox} ((none)) > change master to master_log_file='mysql-bin.000001', master_log_pos=2690;
Query OK, 0 rows affected (0,06 sec)

slave1 [localhost] {msandbox} ((none)) > start slave;
Query OK, 0 rows affected (0,01 sec)

slave1 [localhost] {msandbox} ((none)) > stop slave;

Query OK, 0 rows affected (0,01 sec)

# Use Relay_Master_Log_File and Exec_Master_Log_Pos from SHOW SLAVE STATUS

slave1 [localhost] {msandbox} ((none)) > change master to master_log_file='mysql-bin.000001', master_log_pos=2690;

Query OK, 0 rows affected (0,06 sec)

slave1 [localhost] {msandbox} ((none)) > start slave;

Query OK, 0 rows affected (0,01 sec)

And after repeating the same steps as above (stop sql thread, create sakila2 database on the master, corrupt the relay log, restart the sql thread), we now have a nice error message:

slave1 [localhost] {msandbox} ((none)) > show slave statusG
               [...]
 Slave_IO_Running: Yes
Slave_SQL_Running: No
	       [...]
       Last_Errno: 1594
       Last_Error: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.

slave1 [localhost] {msandbox} ((none)) > show slave statusG

[...]

Slave_IO_Running: Yes

Slave_SQL_Running: No

[...]

Last_Errno: 1594

Last_Error: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.

Great! Now we’re warned that something went wrong. It’s also much easier to fix the problem right now (either manually or with pt-table-checksum/pt-table-sync) than several days or several weeks later, as a single inconsistency may easily translate to massive corruption over time.

If the error message looks too cryptic for you, you can use the new --verify-binlog-checksum option of mysqlbinlog:

# mysqlbinlog --verify-binlog-checksum mysql_sandbox18676-relay-bin.000002
[...]
ERROR: Error in Log_event::read_log_event(): 'Event crc check failed! Most likely there is event corruption.', data_len: 103, event_type: 2
ERROR: Could not read entry at offset 283: Error in log format or read error.
[...]

# mysqlbinlog --verify-binlog-checksum mysql_sandbox18676-relay-bin.000002

[...]

ERROR: Error in Log_event::read_log_event(): 'Event crc check failed! Most likely there is event corruption.', data_len: 103, event_type: 2

ERROR: Could not read entry at offset 283: Error in log format or read error.

[...]

Replication checksums on the master

We can enable them with the following command (they’re disabled by default):

master [localhost] {msandbox} ((none)) > set global master_verify_checksum=ON;
Query OK, 0 rows affected (0,00 sec)

1 2	master [localhost] {msandbox} ((none)) > set global master_verify_checksum=ON; Query OK, 0 rows affected (0,00 sec)

Now if we have corruption in the binary log, we will still be able to write on the master. One way to show the corruption is to use SHOW BINLOG EVENTS:

master [localhost] {msandbox} ((none)) > show binlog events;
ERROR 1220 (HY000): Error when executing command SHOW BINLOG EVENTS: Wrong offset or I/O error

1 2	master [localhost] {msandbox} ((none)) > show binlog events; ERROR 1220 (HY000): Error when executing command SHOW BINLOG EVENTS: Wrong offset or I/O error

But on slaves, the corruption will be obvious as replication will stop with an error:

slave1 [localhost] {msandbox} ((none)) > show slave statusG
	       [...]
 Slave_IO_Running: No
Slave_SQL_Running: Yes
	       [...]
    Last_IO_Errno: 1236
    Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'event read from binlog did not pass crc check; the first event 'mysql-bin.000001' at 2793, the last event read from './mysql-bin.000001' at 2793, the last byte read from './mysql-bin.000001' at 2896.'

slave1 [localhost] {msandbox} ((none)) > show slave statusG

[...]

Slave_IO_Running: No

Slave_SQL_Running: Yes

[...]

Last_IO_Errno: 1236

Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'event read from binlog did not pass crc check; the first event 'mysql-bin.000001' at 2793, the last event read from './mysql-bin.000001' at 2793, the last byte read from './mysql-bin.000001' at 2896.'

Conclusion

You should not assume that replication checksums in MySQL 5.6 will ensure data integrity in all scenarios, but they do improve reliability of the whole replication process. This is a great step forward, thank you replication team at Oracle!

0 0 votes

Article Rating

5 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Jay Janssen

13 years ago

Nice feature, but it smells yet another way to slow down the SQL thread.

Usually a corrupted relay log is quickly caught by the SQL thread, but I suppose if the right cosmic ray hit, it’d only permutate the data without corrupting the full record and this would definitely be a good way to catch that.

Admin

Peter Zaitsev

13 years ago

Jay: Do we have any data about how much it solves the SQL Thread down ? Modern CPUs can compute crc32 in hardware at rate of gigabytes/sec so I would expect the overhead to be negligible though it would be good to examine sometime as a lot can depend on implementation.

Stephane: Thanks for posting and to be frank I’m quite disappointed with error message which diagnoses checksum mismatch. I would really like to see the message to be split in several messages which clearly point to what is the issue instead of “one of 25 things could have went wrong”

khan

13 years ago

Just a side note, checksums can be implemented in 5.5 and earlier by using SSL replication as a workaround.

cbs phoenix

12 years ago

Hello, Neat post. There’s а problem together with your site in internet explorer, could check this?
IE nonetheless is the maretplace chief and a huge part
of other ρwople will leave outt yοur fantasstic writіոg
due to this problem.

Kalyan

11 years ago

replication failed. I/O thread is not working. I got the below error.

2014-06-17 03:32:42 4694 [ERROR] Slave I/O: Replication event checksum verification failed while reading from network. Error_code: 1743
2014-06-17 03:32:42 4694 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Error_code: 1595
2014-06-17 03:32:42 4694 [Note] Slave I/O thread exiting, read up to log ‘*****-bin.00043’, position 142134355

I used STOP SLAVE; START SLAVE;
now the replication is working fine. can i assume that the replication is up to date? if not, what are the things i need to check?