Seconds_Behind_Master fluctuating wildly? Check for events caught in a loop


Recently I was working with a customer where we noticed that Seconds_Behind_Master fluctuating from an expected value of 0 seconds behind to a fairly high six figure value.  The servers were configured in a master-master relationship and used 5 figure server_id values, and we had just migrated this cluster from one data centre to another by re-pointing masters.  Seeing large fluctuations in Seconds_Behind_Master can often be explained by long running queries being processed by the SQL_THREAD, however SHOW PROCESSLIST indicated that there were no long running replication events, and we had no other indication that the server was lagging  due to resource constraints — CPU, disk, and memory were under-utilized.

We then moved our investigation to manual review of the binary log where events appeared normal (5 digit server_id values) until every once in a while we would see a rash of server_id 21 events.. Wait, what?  I asked the customer about this server_id and was informed that this was in fact the old master from the original data centre.  We looked again at the dates on the server_id 21 events and sure enough they were from the time period when the cut-over between data centres took place.  Conclusion: We had a series of old events caught in a loop between the masters that due to log_slave_updates and the lack of a server_id 21 host meant that these events would never be terminated and would loop indefinitely.

So what was the fix?  We leveraged the CHANGE MASTER TO syntax to utilize IGNORE_SERVER_IDS setting:

This virtually immediately caused the server_id 21 events to be dropped and for Seconds_Behind_Master to stop fluctuating.

The MySQL manual has a short discussion regarding IGNORE_SERVER_IDS:

IGNORE_SERVER_IDS was added in MySQL 5.5. This option takes a comma-separated list of 0 or more server IDs. Events originating from the corresponding servers are ignored, with the exception of log rotation and deletion events, which are still recorded in the relay log.

In circular replication, the originating server normally acts as the terminator of its own events, so that they are not applied more than once. Thus, this option is useful in circular replication when one of the servers in the circle is removed. Suppose that you have a circular replication setup with 4 servers, having server IDs 1, 2, 3, and 4, and server 3 fails. When bridging the gap by starting replication from server 2 to server 4, you can includeIGNORE_SERVER_IDS = (3) in the CHANGE MASTER TO statement that you issue on server 4 to tell it to use server 2 as its master instead of server 3. Doing so causes it to ignore and not to propagate any statements that originated with the server that is no longer in use.

Hopefully this article helps someone else trying to explain Seconds_Behind_Master variations!


Leave a Reply

Your email address will not be published. Required fields are marked *