One common theme in the questions our MySQL Support customers ask is MySQL Replication Lag. The story is typically along the lines everything is same as before and for some unknown reason the slave is started to lag and not catching up any more. I always smile at “nothing has changed” claim as it usually wrong, and one should claim I’m not aware of any change. Digging deeper we would find some change, though often subtle as computers are state machines and with no external and internal changes they behave the same way. First let me start pointing out common causes of replication lag
Hardware Faults Hardware faults may cause reduced performance causing replication to lag. Failed hard drive getting RAID in degraded mode could be one common example. Sometimes it is not fault per say but operating mode change may cause the problem. Many RAID controllers have to periodically go through battery learning cycle which switches RAID cache in write-through mode which can cause serve performance degradation.
Configuration Changes Have you changed MySQL configuration ? OS Configuration or Hardware configuration or could have anyone changed it without your knowledge ? If yes this could cause problems with replication lag (and performance in general). Typical problems for replication include setting sync_binlog=1, enabling log_slave_updates, setting innodb_flush_log_at_trx_commit=1. Also note there is no such thing as safe change. Even such “it can only do good” change as increasing buffer pool size could cause performance problems.
MySQL Restart MySQL Restarts may include recovery time but what is the most important they often come with a need of warmup, because buffer pool content is lost on restart. Many people understanding this do not quite recognize how long it takes, and by the fact removing slave read traffic from non warmed up slave will often cause warmup to be longer, because it has to happen in the single thread. In the worse case scenario you might be looking at some 100 pages per second which is 1.6MB/sec which may take 24+hours to feel 200GB buffer pool. Automatic Buffer Pool Save/Restore available in Percona Server is great help in this case.
Changing Slave Role Changing Slave role, such as swapping active and passive masters in Master-Master setup is often cause of the problem for similar reason as MySQL Restart – the traffic on different boxes might be different and hence cache content.
Replication Traffic Changes Changing in the replication traffic (what updates are being replicated) are frequent cause of concern. Remember replication runs as a single thread and something which might not look like a lot of load for your powerful master may be enough to take replication overboard. Application changes and Traffic increases are frequent causes of problems but more subtle things such as “small” batch job being ran which does a lot of updates can also be the problem. It is nice to analyze your replication traffic regularly with mk-query-digest (Percona Server marks Slave Thread in slow query log so it is easy to filter out events) so you can see whenever there are any suspicious events in the replication traffic. An interesting thing to note – scripts can hurt replication traffic well after they are ended. I’ve seen cases when overly eager batch job doing updates from 20 threads on the master will create enough updates for a few hours, to make replication lag for over 24 hours until they all could be done with.
Slave Read Traffic Changes Slave read traffic impacts replication performance dramatically and even if nothing has changed to the replication traffic, changes to concurrent traffic on the slave can affect replication capacity a lot. Checking traffic for changes with mk-query-digest is a great way to deal with it. Note you need not only to check the types and frequencies of queries but also how much IO they are causing and how many rows they analyzes. Changes could be in the data which is being accessed by the queries or query execution plans.
Side load on IO subsystem Could there be any side load on IO subsystem ? This can take many forms. It can be you running backup through xtrabackup or LVM snapshot, RAID controller doing RAID validation, some centrally kicked of SAN backup, intensive activity on the other tenants sharing physical hardware with you in SAN, NAS or Cloud installations. It all can be the cause of the problems. In case the side load is going through the same operating system you should be able to see it by comparing IO MySQL is doing to IO going to devices from OS standpoint. If this is external and not recorded by operating system you often can see it as higher IO load (IO utilization, response time) while less work is being done in terms of requests per second. Should be easy to see on the graphs. If you have not yet installed something for graphing check out Cacti Tempates which have a lot of such goodness.
Data Size ChangesThe more data you have the more IO bound workload is likely to be and also it is possible queries become more complicated scanning more rows than before. Data growth is frequent cause of performance problems and slave lag among them. The relationship is also very workload dependent – some may see only modest slowdown when data doubles, while in other case 10-20% difference in data size can be the difference between slave never lags and slave never is able to catch up.
MySQL or OS version changesMySQL version changes (even minor upgrade) may introduce issues which causes replication lag. Same may happen with operating system and other system software involved. In could be even system which queries mysql. I’ve seen monitoring updates which would add excessively expensive and frequent query on information_schema which caused a lot of issues. Note there is a lot of complicated software which runs on devices themselves now, so watch for things like firmware updates bios updates etc, which even though not frequent may be cause of the issues.
Finally – do not be mistaken, it is not always just one issue which is responsible for everything. Quite often it is many things which come together to form the perfect storm – just a bit of data growth which just a bit different workload and when backup taken at the wrong time may be responsible for unexpected MySQL replication lag.