The problem of MySQL Replication unable to catch up is quite common in MySQL world and in fact I already wrote about it. There are many aspects of managing mysql replication lag such as using proper hardware and configuring it properly. In this post I will just look at couple of query design mistakes which result in low hanging fruit troubleshooting MySQL Replication Lag
First fact you absolutely need to remember is MySQL Replication is single threaded, which means if you have any long running write query it clogs replication stream and small and fast updates which go after it in MySQL binary log can’t proceed. It is either more than than just about queries – if you’re using explicit transactions all updates from the transactions are buffered together and when dumped to binary log as one big chunk which can’t be interleaved by any other query execution. So if you have transaction containing millions of simple updates instead of one large update to help MySQL replication lag it is not going to work.
This brings us to rule number one – if you care about replication latency you must not have any long running updates. Queries or transactions containing multiple update queries which add up to long time. I would keep the maximum query length at about 1/5th of the maximum replication lag you’re ready to tolerate. So if you want your replica to be no more than 1 minute behind keep the longest update query to 10 sec or so. This is of course rule of thumb depending on differences in master/slave configuration, their load and concurrency you may need to keep the ratio higher or allow a bit longer queries.
What should you do if you need to update a lot of rows ? Use Query Chopping – this can be running update/delete with LIMIT in the loop, controlling maximum amount of values per batch in multiple row insert statement or Fetching data you’re planning to update/delete and having multiple queries to delete it (see example below)
This brings us to yet another rule for smart replication – do not make Slave to do more work than it needs to do. It is crippled by having to do all of this in single thread already – do not make it even harder. If there is considerable effort needed to select rows for modification – spread it out and have separate select and update queries. In such case slave will only need to run UPDATE
UPDATE posts SET spam=1 WHERE body LIKE "%cheap rolex%";
This query will perform full table scan in MySQL 5.0 (even if there are no spam posts) which will load slave significantly. You can replace it with:
SELECT id FROM posts WHERE body LIKE "%cheap rolex%";
UPDATE posts SET spam=1 WHERE id IN (list of ids)
If there could be many ids matched on the first place you should also use query chopping and run update in chunks if application allows it.
In MySQL 5.1 with row level replication you will not have selection process running on SLAVE but it will not do the chopping for you.
In general this trick does not only work well for full table scan updates but in general for cases when there are much more rows examined than modified.
The next common mistake is using INSERT … SELECT – which is in similar to what I just described but can be much worse as SELECT may end up being extremely complicated query. It is best to avoid INSERT … SELECT going through replication in 5.0 for many reasons (locking, long query time, waste of execution on slave). Piping data through application is the best solution in many cases and is quite easy – it is trivial to write the function which will take SELECT query and the table to which store its result set and use in your application in all cases when you need this functionality.
Finally you should not overload your replication – Quite typically I see replication lagging when batch jobs are running. These can load master significantly during their run time and make it impossible for slave to run the same load through single thread. The solution in many cases is to simply space it out and slow down your batch job (such as adding sleep calls) to ensure there is enough breathing room for replication thread.
You can also have controlled execution of batch job – this is when they will check slave lag every so often and pause if it becomes too large. This is a bit more complicated approach but it saves you from running around and adjusting your sleep behavior to keep the progress fast enough and at the same time keep replication from lagging.
In many bad replication lags I’ve seen simply following these simple rules would avoid a lot of problems and often save massive hardware purchases or development efforts based on assumption MySQL replication can’t possibly keep up any more.