October 25, 2014

Managing Slave Lag with MySQL Replication

The question I often get is how far MySQL may fall behind and how to keep replication from lagging.

The lag you will see will vary a lot from application to the application and from load to load. Plus what is the most important within same application the lag will likely have spikes – most of applications would have typical lag within few milliseconds while there will be rare cases when replication lags behind several seconds or even longer.

There are multiple reasons why application falls behind and why we see those lag spikes:

Slave Server Load – MySQL Replication goes in single thread so it is really vulnerable to the server load. If you get 100 active queries running on the slave overloading it, slave thread will most likely will not get CPU or Disk resources it needs in order to keep going. So if you want to keep replication lag under control you need to keep Slave load under control and avoid load spikes as load spikes will frequently cause replication lag spikes as well.

Locks – MySQL Replication SQL Thread executes queries same as any other thread and it has to grab the locks needed for query execution – either row level locks for Innodb or table locks for MyISAM and any DDL statements. If you happen to have queries which set a lot of locks on the Slave expect lag spike. For example MyISAM slaves used for long reporting queries can have queries running for hours which stalls replication progress for this time. As MySQL Replication goes in single thread it is enough for one statement to get stuck to have all replication stalled.

Long Queries – MySQL Replication executes queries one after another on the slave, so if you have query which takes certain time on the master you should expect Slave lag spike for at least this amount of time (assuming it takes the same time Master and Slave to run the query). If you have a query which runs 1 minute on the Master it will likely run 1 Minute on the slave stalling all newer updates propagation for the time it runs.

This one is actually the most fun to deal with as there are many techniques and workarounds. First you normally start with query chopping making sure all your update queries are short. If you need to update 1.000.000 rows you do not do it in the same query but get 1000 small and short queries instead.

This however does not work when you need to ALTER TABLE, though running it on Master is frequently not an option either. So this query is often run directly on the Slave(s) when master switched to one of them (especially easy with Master-Master setup) and when query repeated on the second one. Doing this on Master-Master setup do not forget to disable binary logging with SET SQL_BIN_LOG=0 before you run ALTER TABLE query.

Replication Overload Many tutorials out where will give you impression problem with replication starts only when write load is high enough so Master is almost fully busy handling the writes. This is not the case – because MySQL Replication runs in single thread (two threads, but only one of them executes the query) it is not able to use the CPU or the Disks on the slaves as efficient on the Master and the more powerful the Master is the smaller portion of full write capacity you will be able to pass through the replication. If you happen to have IO bound load and have 20 hard drives on the master you may be able to replicate only 10% of the master capacity to the slave.

So watch your replication and see if it getting overloaded. Typically you will see it as increased lag spikes as well as lag which increases and stays high for significant amount of time.

Idea: MySQL Should add “Replication Load Average” metric which would tell what time replication thread was busy processing events compared to being idle. This would help a lot to see when you’re close to the limit.

To manage replication lag and to understand what is loading your replication it is also helpful to examine execution time of queries being replicated. Unfortunately MySQL Slow Query Log does not log replication queries. Though this is one of few slow query logging improvements you can get with our Slow Query Log patch.

You may also interested to know how to measure replication lag – MySQL Toolkit has a great tool for real latency measurement. Do not trust Seconds_Behind_Master too much, even though it is close to the true most of the time there are number of cases when it will be way off from reality.

And the final advice – do not assume very short lag time when planning your application. Having application which can adapt to lag time rather than break is very good idea. Especially it will be handy when you will be reaching replication capacity and will need to buy time to fix things, during which replication lag can be higher than normally.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. If we can get the ‘SHOW USER_STATISTICS’ functionality from the Google patch into MySQL, then you can compute replication load average using the Busy_time output for the system user and Uptime as reported by SHOW STATUS. Busy_time is the number of seconds that user had been running SQL statements.

    I had not thought of monitoring this before. It is a great idea.

  2. peter says:

    Thanks Mark,

    Hopefully you will be able to get your patch accepted soon.

  3. Xaprb says:

    Yes, it would be great to have at least parts of Google’s patches added. Right now (most of you readers know this, but for those who don’t) you have no idea how “close to the ceiling” you are with replication. You find out when the slave abruptly can’t keep up anymore.

  4. Michael says:

    I highly recommend using local or no_write_to_binlog instead of disabling binary logging before altering the table, it prevents headaches, and forgetfulness causing problems later on.

    Michael

  5. peter says:

    Michael,

    Indeed no_write_to_binlog or local can be better for alter table optimize table.

    To bad it is not supported as a flag for all commands as you may be willing to run without replication any kind of command.

  6. jenny says:

    The replication query execution information can easily be traced with Dtrace by inserting the dtrace probes into the mysql_update function, plus, it can report query excution less than 1 sec (ms or ns)

  7. Mark says:

    In a lot of cases it’s hard to find a way speed up replication, so you have to be creative within your web apps. Another big problem I found is trying to test/report replication speeds under heavy load.

    I came up with a way for situations where you need to INSERT then SELECT the same data instantly – this is by using master/slave circular replication design, (the master/slaves can also have read-only slaves)… e.g. INSERT into the master – return the masters IP Address, then make a connection to the same master with the returned IP Address and select that same data that just got inserted into the master/slave. This can work for certain situations but can be hard getting the right load balance to the masters.

    Good luck :)

  8. Robin says:

    Setting LOW_PRIORITY on UPDATE/DELETE/INSERT queries may also cause a replication lag. It works fine on the master, but when it is being replicated to the slaves, it will cause the replication thread to wait until there are no more readers on that table.

  9. Smita says:

    Can we replicate a MyISAM master table to Inoodb Slave?

    The reason behind this kind of config is to avoid long select statement locks to prevent Slave tables from Updating.

    We ran into a situation where the Slave was being queried against a large table and the query took too long to run which prevented any update on that table, hence breaking replication. If we had Innodb, we would not run into such a situation.

    Is it possible to have master MYISAM table to be replicated on Slave INNODb table?
    What are the cons of such a configuration?

    Please let meknow.

  10. vitamin-R says:
  11. madhuri says:

    It’s very useful and informative 2….
    Thanks a bunch! :)

    Regards.,
    Madhuri

  12. adeela says:

    hello

    Can anyone help me to find out why always may slave machine lacks behind .. i enabled query log .. since 3 days no queries are found on it. and their is no load on the server memory is also there. still my slave lacks behind. sometine it is increasing but i found no query in slow log. the lacking is decreasing very slowly. but i want the slave not to lack every time. what more things i can check to find the reason of this problem.

  13. adeela says:

    hello

    Please reply if anyone have the answer .. reply me

  14. MySQL says:

    I am running Master-Slave on MySQL.
    I have couple of queries which runs every 25 minutes and does inserts, and updates and replace for ~100,00000 rows at a time.
    And most of the time i am facing > 1000 sec lag…any help???

  15. mbayoumi says:

    Hello Peter,

    Thanks for the great article. I am facing a big issue with mysql replication.
    I have a master db server for a distributed app that opens almost 150 connection and executes very light queries (all are inserts on duplicate key update).

    I have setup 2 slaves from that. One is running perfectly fine which also holds a light app that runs quick queries (inserts mostly) but the other is being totally killed by this replication (this other holds an app that performs heavy queries on millions of records to perform digital pricing for books …etc) that i have to stop replication everytime we are running this heavy process which is of course totally not feasible and the gap keeps increasing as we stop/start replication.

    I need to understand why starting replication on this slave totally kills its performance ! I run top and notice CPU and disk usage but nothing speaks out (no spikes or outrageous processes running and the slave server looks absolutely healthy).

    Few things i have tried with admins that helped but just very little bit; was to get a speedy RAID volume and isolate all DBs which run the heavy queries on but we couldn’t achieve the desirable normal performance (i.e. when slave is stopped).
    I am being really puzzled and in desperate need of help with this !!

    Thanks
    mbayoumi

Speak Your Mind

*