September 18, 2014

max_allowed_packet and binary log corruption in MySQL

The combination of max_allowed_packet variable and replication in MySQL is a common source of headaches. In a nutshell, max_allowed_packet is the maximum size of a MySQL network protocol packet that the server can create or read. It has a default value of 1MB (<= 5.6.5) or 4MB (>= 5.6.6) and a maximum size of 1GB. This adds some constraints in our replication environment:

  • The master server shouldn’t write events to the binary log larger than max_allowed_packet
  • All the slaves in the replication chain should have the same max_allowed_packet as the master server

Sometimes, even following those two basic rules we can have problems.

For example, there are situations (also called bugs) where the master writes more data than the max_allowed_packet limit causing the slaves to stop working. In order to fix this Oracle created a new variable called slave_max_allowed_packet. This new configuration variable available from 5.1.64, 5.5.26 and 5.6.6 overrides the max_allowed_packet value for slave threads. Therefore, regardless of the max_allowed_packet value the slaves’ threads will have 1GB limit, the default value of slave_max_allowed_packet. Nice trick that works as expected.

Sometimes even with that workaround we can get the max_allowed_packet error in the slave servers. That means that there is a packet larger than 1GB, something that shouldn’t happen in a normal situation. Why? Usually it is caused by a binary log corruption. Let’s see the following example:

Slave stops working with the following message:

The important part is “got fatal error 1236 from master”. The master cannot read the event it wrote to the binary log seconds ago. To check the problem we can:

  • Use mysqlbinlog to read the binary log from the position it failed with –start-position.

This is an example taken from our Percona Forums:

Check the size of the event, 1953066613 bytes. Or the “Unknown event” messages. Something is clearly wrong there. Another usual thing to check is the server id that sometimes doesn’t correspond with the real value. In this example the person who posted the binary log event confirmed that the server id was wrong.

  • Check master’s error log.

Again, the event is bigger than expected. There is no way the master and slave can read/write it, so the solution is to skip that event in the slave and rotate the logs on the master. Then, use pt-table-checksum to check data consistency.

MySQL 5.6 includes replication checksums to avoid problems with log corruptions. You can read more about it in Stephan’s blog post.

Conclusion

Errors on slave servers about max_allowed_packet can be caused by very different reasons. Although binary log corruption is not a common one, it is something worth checking when you have run out of ideas.

About Miguel Angel Nieto

Miguel joined Percona in October 2011. He has worked as a System Administrator for a Free Software consultant and in the supporting area of the biggest hosting company in Spain. His current focus is improving MySQL and helping the community of Free Software to grow. Miguel's roles inside Percona are Senior Support Engineer and Manager of EMEA Support Team.

Comments

  1. sbester says:

    For one genuine case, see the description of Bug #17842137 in the 5.6.16 change history…. Another commonly seen cause is incorrect positions for CHANGE MASTER, placing the slave to read in the middle of some event.

Speak Your Mind

*