Working day to day with Percona Remote DBA customers, we have been facing an issue from time to time when MySQL replication lag is flapping between 0 and XXXXX constantly – i.e. Seconds_Behind_Master is 0 for a few secs, then it’s like 6287 or 25341, again 0 and so on. I would like to note the 5 different scenarios and symptoms of this – some are obvious and some are not really.
1. Duplicate server-ids on two or more slaves.
Symptoms: MySQL error log on a slave shows the slave thread is connecting/disconnecting from a master constantly.
Solution: check whether all nodes in the replication have unique server-ids.
2. Dual-master setup, “log_slave_updates” enabled, server-ids changed.
Scenario: you stop MySQL on the first master, then you stop the second one. Afterwards, you perform some migration or maintenance. Suddenly, you realize that it would be better to change server-ids to match the end part of new IP addresses. You bring everything back online and noticed the strange behaviour like some portion of binlogs cycle between masters due to log_slave_updates enabled. But how come if new server-ids are different? It’s really could be true when the some data were written to the second master prior its shutdown when the first one was already off and both nodes started recognizing cycled data as not their own, thus applying them and passing down the replication as sender’s server-id does not match server-id written in the binlogs data itself prior shutdown. So we have got an infinite loop.
Solution: simply reset the slave position on one of the masters a new, so finally the “ghost” binlogs will stop cycling.
3. MySQL options “sync_relay_log”, “sync_relay_log_info”, “sync_master_info”.
Symptoms: according to the “SHOW SLAVE STATUS” output, one time the slave thread queues events showing some delay, another time waits for master showing 0 lag and so on. Considering the real master position, indeed there should be XXXXX delay. Another symptoms are IO saturation and high disk IOPS number but a disk is only busy a half as shown by pt-diskstats. In my case I was observed 1500 iops on master, 10x more – 15000 iops being 60% busy on slave. You may think it could be a row-based replication (binlog_format=ROW) and constant updates on a master but still why the delay is flapping and disk IO is high but not on 100% of capacity?
Possible reason: check whether you do not have enabled sync_relay_log, sync_relay_log_info, sync_master_info which make slave to sync data to disk on every event, e.g. sync_relay_log ensures one write to the relay log per statement if autocommit is enabled, and one write per transaction otherwise. Enabling those options makes your system slow unless the disk has a battery-backed cache, which makes synchronization very fast.
4. Network latency.
Symptoms: there is a poor network link between master and slave. With saturated link the slave IO_thread will have a hard time keeping up with master. SQL thread reports zero lag while it just waits for more events stored by IO_thread. Once IO_thread is able to write more into relay log, SQL_thread will report XXX seconds behind again.
Solution: check the network throughput.
5. Late committed transactions.
Scenario: if you open a transaction on master and then for some reason (maybe application bug or frontend problems) you commit it some time later, it will show it as replication latency on the slave. So for example, if you wait one hour before commit, it will show 3600 seconds of latency on the slave while that transaction is being executed.
Solution: use pt-heartbeat tool.