September 15, 2014

Just how useful are binary logs for incremental backups?

We’ve written about replication slaves lagging behind masters before, but one of the other side effects of the binary log being serialized, is that it also limits the effectiveness of using it for incremental backup.  Let me make up some numbers for the purposes of this example:

  • We have 2 Servers in a Master-Slave topology.
  • The database size is 100 GB (same tables on each).
  • The slave machine barely keeps up with the master (at 90% capacity during peak, 75% during offpeak)
  • The peak window is 12 hours, the offpeak window is 12 hours.

Provided that the backup method was raw data files, it shouldn’t take much more than 30 minutes to restore 100GB (50MB/s), but to replay one day of binary logs  it would take an additional 20 hours ((12 * 0.9) + (12 * 0.75) = 19h48m).

If you wanted to do something like setup a new slave with a 24-hour old backup, and apply the binary logs continuously until it catches up, you will be waiting almost 5 days until that happens (each day has 24hr-19h48m = 4h12m “free” capacity, 19h48m/4h12m = 4.7 days).

So what are the solutions?
If you are using all InnoDB tables, an XtraBackup incremental backup should be much faster than using binary logs.  You can understand Vadim’s excitement when he announced this feature a few months ago.

If you are using multiple storage engines, then your options are to either try and time delay slaves (to keep them close to up to date), or hope that it’s not often that you need to restore!  Eventually this problem should be lessened by a MySQL Server feature – parallel execution on slaves.  Lets hope that it can get enough testers so that it makes it into a new release very quickly.

About Morgan Tocker

Morgan is a former Percona employee.
He was the Director of Training at Percona. He was formerly a Technical Instructor for MySQL and Sun Microsystems. He has also previously worked in the MySQL Support Team, and provided DRBD support.

Comments

  1. peter says:

    Morgan,

    I would mention if you’re running your slave at 90% capacity during peak load you’re looking for trouble big deal.

    I would highly recommend slave thread busy no more than 30-50% – this makes sure you have at least some time to take an action as load or data size growths. With 90% you can wake up one morning and see things never catch up :)

  2. Daniellek says:

    Hardcore (but working) solution:
    Setup: Master+2 slaves (one slave is “production”, and the other is “backup”)

    mysql-backup stop
    tar czvf ….
    mysql-backup start

  3. Morgan Tocker says:

    @Peter – Yes, I chose a pretty harsh set of number for the purpose of illustration – but there are a lot of people in this category. If we follow your advice, and repeat the math with 50% slave load (peak), 30% slave load (off peak), it doesn’t look as bad:

    To replay logs: (12 * 0.50) + (12 * 0.30) = 9h36m
    For slave to come online: (each day has 24hr-9h36m = 14h24m “free” capacity, will be online in less than a day).

    @Daniellek – I think your solution is a good one. Always have a slave with no load up to date and ready to clone from. You still need to remember that it doesn’t count as a backup. There’s a range of problems (i.e. accidental/malicious delete) that you need to build a strategy for and cope with.

  4. Just to mention that Xtrabackup incremental backups do not backup MyISAM/ARCHIVE tables (yet), and so provide incremental backup only fo InnoDB tables.

  5. I’m curious why the implementation notes in that workload preclude UPDATE. UPDATE can be logically decomposed into DELETE followed by INSERT. This doesn’t inhibit transaction serialization.

  6. Hi Morgan,

    Have you tried Tungsten Replicator? It has built-in time-delay replication. Also, it now has built-in backup/restore support (checked in; will be in an official build shortly). I’m going to look at putting in support for invoking XtraBackup as it seems pretty good, especially the incremental capacity.

    Finally, as Peter mentioned, running replication anywhere near full capacity is asking for trouble. One thing we are planning for later in the year in Tungsten is to split events into different “channels” for replication in parallel. Parallel replication seems to be particularly helpful when you have events that are truly slow, such as DDL, or for replication of data that are already sharded.

Speak Your Mind

*