Managing Slave Lag with MySQL ReplicationPeter Zaitsev
The question I often get is how far MySQL may fall behind and how to keep replication from lagging.
The lag you will see will vary a lot from application to the application and from load to load. Plus what is the most important within same application the lag will likely have spikes – most of applications would have typical lag within few milliseconds while there will be rare cases when replication lags behind several seconds or even longer.
There are multiple reasons why application falls behind and why we see those lag spikes:
Slave Server Load – MySQL Replication goes in single thread so it is really vulnerable to the server load. If you get 100 active queries running on the slave overloading it, slave thread will most likely will not get CPU or Disk resources it needs in order to keep going. So if you want to keep replication lag under control you need to keep Slave load under control and avoid load spikes as load spikes will frequently cause replication lag spikes as well.
Locks – MySQL Replication SQL Thread executes queries same as any other thread and it has to grab the locks needed for query execution – either row level locks for Innodb or table locks for MyISAM and any DDL statements. If you happen to have queries which set a lot of locks on the Slave expect lag spike. For example MyISAM slaves used for long reporting queries can have queries running for hours which stalls replication progress for this time. As MySQL Replication goes in single thread it is enough for one statement to get stuck to have all replication stalled.
Long Queries – MySQL Replication executes queries one after another on the slave, so if you have query which takes certain time on the master you should expect Slave lag spike for at least this amount of time (assuming it takes the same time Master and Slave to run the query). If you have a query which runs 1 minute on the Master it will likely run 1 Minute on the slave stalling all newer updates propagation for the time it runs.
This one is actually the most fun to deal with as there are many techniques and workarounds. First you normally start with query chopping making sure all your update queries are short. If you need to update 1.000.000 rows you do not do it in the same query but get 1000 small and short queries instead.
This however does not work when you need to ALTER TABLE, though running it on Master is frequently not an option either. So this query is often run directly on the Slave(s) when master switched to one of them (especially easy with Master-Master setup) and when query repeated on the second one. Doing this on Master-Master setup do not forget to disable binary logging with SET SQL_BIN_LOG=0 before you run ALTER TABLE query.
Replication Overload Many tutorials out where will give you impression problem with replication starts only when write load is high enough so Master is almost fully busy handling the writes. This is not the case – because MySQL Replication runs in single thread (two threads, but only one of them executes the query) it is not able to use the CPU or the Disks on the slaves as efficient on the Master and the more powerful the Master is the smaller portion of full write capacity you will be able to pass through the replication. If you happen to have IO bound load and have 20 hard drives on the master you may be able to replicate only 10% of the master capacity to the slave.
So watch your replication and see if it getting overloaded. Typically you will see it as increased lag spikes as well as lag which increases and stays high for significant amount of time.
Idea: MySQL Should add “Replication Load Average” metric which would tell what time replication thread was busy processing events compared to being idle. This would help a lot to see when you’re close to the limit.
To manage replication lag and to understand what is loading your replication it is also helpful to examine execution time of queries being replicated. Unfortunately MySQL Slow Query Log does not log replication queries. Though this is one of few slow query logging improvements you can get with our Slow Query Log patch.
You may also interested to know how to measure replication lag – MySQL Toolkit has a great tool for real latency measurement. Do not trust Seconds_Behind_Master too much, even though it is close to the true most of the time there are number of cases when it will be way off from reality.
And the final advice – do not assume very short lag time when planning your application. Having application which can adapt to lag time rather than break is very good idea. Especially it will be handy when you will be reaching replication capacity and will need to buy time to fix things, during which replication lag can be higher than normally.