Seconds_Behind_Master fluctuating wildly? Check for events caught in a loop

Posted on:



Share Button

Recently I was working with a customer where we noticed that Seconds_Behind_Master fluctuating from an expected value of 0 seconds behind to a fairly high six figure value.  The servers were configured in a master-master relationship and used 5 figure server_id values, and we had just migrated this cluster from one data centre to another by re-pointing masters.  Seeing large fluctuations in Seconds_Behind_Master can often be explained by long running queries being processed by the SQL_THREAD, however SHOW PROCESSLIST indicated that there were no long running replication events, and we had no other indication that the server was lagging  due to resource constraints — CPU, disk, and memory were under-utilized.

We then moved our investigation to manual review of the binary log where events appeared normal (5 digit server_id values) until every once in a while we would see a rash of server_id 21 events.. Wait, what?  I asked the customer about this server_id and was informed that this was in fact the old master from the original data centre.  We looked again at the dates on the server_id 21 events and sure enough they were from the time period when the cut-over between data centres took place.  Conclusion: We had a series of old events caught in a loop between the masters that due to log_slave_updates and the lack of a server_id 21 host meant that these events would never be terminated and would loop indefinitely.

So what was the fix?  We leveraged the CHANGE MASTER TO syntax to utilize IGNORE_SERVER_IDS setting:

This virtually immediately caused the server_id 21 events to be dropped and for Seconds_Behind_Master to stop fluctuating.

The MySQL manual has a short discussion regarding IGNORE_SERVER_IDS:

IGNORE_SERVER_IDS was added in MySQL 5.5. This option takes a comma-separated list of 0 or more server IDs. Events originating from the corresponding servers are ignored, with the exception of log rotation and deletion events, which are still recorded in the relay log.

In circular replication, the originating server normally acts as the terminator of its own events, so that they are not applied more than once. Thus, this option is useful in circular replication when one of the servers in the circle is removed. Suppose that you have a circular replication setup with 4 servers, having server IDs 1, 2, 3, and 4, and server 3 fails. When bridging the gap by starting replication from server 2 to server 4, you can includeIGNORE_SERVER_IDS = (3) in the CHANGE MASTER TO statement that you issue on server 4 to tell it to use server 2 as its master instead of server 3. Doing so causes it to ignore and not to propagate any statements that originated with the server that is no longer in use.

Hopefully this article helps someone else trying to explain Seconds_Behind_Master variations!

Share Button

Michael Coburn

Michael joined Percona as a Consultant in 2012 after having worked with high volume stock photography websites and email service provider platforms. With a foundation in Systems Administration, Michael enjoys working with SAN technologies and high availability solutions. A Canadian, Michael currently lives in Costa Rica with his wife, two children, and two dogs.


Insight for DBAs, MySQL, Percona MySQL Consulting

  • Hi Michael,

    We are using percona version 5.6.15 for Master Slave Configuration.
    For past few weeks we are noticing increase in value for seconds_behind_master.

    We had changed this innodb-flush-log-at-trx-commit = 2
    in my.cnf & restarted mysql service still there is no change.

    As of now the seconds behind master is value is Seconds_Behind_Master: 42293

    Not sure how to bring it down.Could you please help me with this.

    By the way we are using 1000 iops for our DB machines with 15 GB Memory in Amazon .

  • Michael Coburn Post author

    Hi Ravi,

    The slave is only able to apply events/statements as quickly as the resources available are able to keep up — that means that you could be hitting a bottleneck of the capacity of your CPU or disks. You should check to see if you are exceeding any of these thresholds. You can also try setting sync_binlog=0 if that has not already been done. Otherwise, I encourage you to refer to our Forums for additional support as this blog isn’t set up to deal with specific, ongoing advice unlike our Forums: https://www.percona.com/forums/questions-discussions/mysql-and-percona-server . Best of luck!

Leave a Reply