September 2, 2014

Three ways to know when a MySQL slave is about to start lagging

The trouble with slave lag is that you often can’t see it coming. Especially if the slave’s load is pretty uniform, a slave that’s at 90% of its capacity to keep up with the master can be indistinguishable from one that’s at 5% of its capacity.

So how can you tell when your slave is nearing its capacity to keep up with the master? Here are three ways:

One: watch for spikes of lag. If you have Cacti (and these Cacti templates for MySQL) you can see this in the graphs. If the graphs start to get a little bumpy, you can assume that the iceberg is floating higher and higher in the water, so to speak. (Hopefully that’s not too strange a metaphor.) As the slave’s routine work gets closer and closer to its capacity, you’ll see these spikes get bigger and “wider”. The front-side of the spike will always be less than a 45-degree angle in ordinary operation[1] but the back-side, when the slave is catching up after lagging behind, will become a gentler and gentler slope.

Two: deliberately make a slave fall behind, then see how fast it can catch up. This is sort of related to Method One. The goal here is to explicitly see how steep the backside of that slope is. If you stop a slave for an hour, then start it again and it catches up in one hour, it is running at 1/2 of its capacity. (In case that’s confusing: if you stop it at noon and restart it at 1:00, and it’s caught up again at 2:00, it played all statements from 12:00 to 2:00 in 1 hour, so it went at 2x speed.)

Three: measure it more scientifically. Use our patched server, which gives you a USER_STATISTICS table.

You can compare the BUSY_TIME to one-half the CONNECTED_TIME (because there are two replication threads on the slave) to see how much of the time the slave thread was actively processing statements. If the slave threads are always running, you can just use the server’s uptime instead.

[1] There are cases where this isn’t true, especially if you’re monitoring Seconds_behind_master instead of using mk-heartbeat, which is immune to this anomaly.

About Baron Schwartz

Baron is the lead author of High Performance MySQL.
He is a former Percona employee.

Comments

  1. Rob Wultsch says:

    Any general comments about max advised load on slaves?

  2. Don’t load your slaves so much that they fall behind.

    Plan for the slaves to have a hard time keeping up with the master, even with the same or better hardware.

    Plan for slave lag and make your application cope with it.

  3. Assuming that you have no network problem between your nodes. What would you do next. I mean if your slave begins to lag behind your master

  4. Patrick Domack says:

    I have been using the mk-heartbeat type method for a few years now (using my own version). I found this excentual to actually being able to tell many issues with a slave itself. I have also wrapped the main critical mysql connected around checking the lag, and if it’s excessive, to switch to a different slave. It has worked well.

  5. frederic, you have to take some load off the server. We have a whole section in our book about how to help slaves keep up. It’s in the replication chapter. http://www.amazon.com/gp/product/0596101716?tag=perinc-20

  6. Sean says:

    Baron, any future plans to create cacti graphs to log the *_statistics tables and/or include the #mysql_system# user in the replication templates? :)

  7. peter says:

    Baron,

    I would note “Bumpy” Lag graphs can also start happening in case you have some long executing queries being passed to the slave which is bad thing on itself.

    I would recommend avoiding queries longer than 1/3rd of accepted lag in the replication. Ie if you want to be within 60sec you should ensure no queries going through replication take more than 20sec to execute.

    Speaking about slave load I’d like to see slave to have at least 5x capacity when it is idle and at least 3x when it is operating during normal load this means you can take some spikes without lagging, though it is very application dependent.

    Considering how much it takes to do roll forward recovery from backup is another important thing to keep in mind.

  8. Sean, describe what you’d like them to graph. Oh, and if you want commit rights to the SVN repository, just say yes :)

Speak Your Mind

*