September 2, 2014

Recovery beyond data restore

Quite frequently I see customers looking at MySQL recovery as on ability to restore data from backup which can be far from being enough to restore the whole system to operating state, especially for complex systems.

Instead of looking just at data restore process you better look at the whole process which is required to bring system to the working state, including data consistency requirements and times. This has to be considered for different data loss scenarios which may happen.

Let us look at simple example – a master with 1TB of database size replicating to 50 servers in 5 different Data Centers via single Replication Relay server in each. Forget the single point of failure for the second and just think what problems we may have to deal with.

First lets look at the master. What may happen to it ? We can have Master having soft crash in which case it will be unavailable for some time but we can get all the data back… or sort of. In practice you have to be very careful such as using sync-binlog innodb_flush_logs_at_trx_commit=1 and only using Innodb tables to be OK in most cases. There are still some edge cases such as modifying meta data stored in MyISAM tables which can get master out of sync with slaves in case of soft crash. Unless you got into one of these rare cases slave should be able to continue after Master is back online.

Do you have to wait for master to recover ? This is where your data consistency requirements come in play. Remember replication is asynchronous so whenever you switch to the slave in case of master failure you may loose transactions. Google semi-synchronous replication patches can help with it… but they are not yet in the stock MySQL. Yet another way is using DRBD to get a standby MySQL server or at least synchronously replicated master binary logs. If you can’t loose any single transaction you’ve can’t simply switch to the slave.

What if you can ? The switch to the slave in this case is not very easy too – all slaves can be on different positions at the master and you need to pick the most up to date to promote. Plus you need to recompute positions as they should be on promoted slave and slave should have –log-slave-updates enabled so it somethat has copy of master logs. In many cases I’ve seen people do not do that and simply point slaves to the starting position of the promoted master – this is dangerous because you’re risking all slaves to be inconsistent withe each other, plus if server was seriously behind you’re risking to get major inconsistence because relay logs will be lost if you just re-point slave. So at least you should wait for slave to process all its relay logs before re-pointing it.

Interesting enough Google has solution for us again which comes as “log mirroring” patches which make sure Slaves has copy of logs as they are on the master.

Now what do you do in case of hard crash this is when the data is lost on the master ? This is when you have master data lost, such as you have RAID or disk failure. Though it also can be things like Innodb corruption or soft crash which you can’t recover promptly enough.

In this case most typically you would plan recovery by switching to the slave (as described) or standby server via DRBD or SAN.

As you can see we never mentioned recovering from backup so far. It will be needed in the worst case of data loss which is trashing the data which gets to all slaves via replication. This can be caused by user or application error or security breach.

What choices do you have in this case ? Your main options are using Backup or Slave with delayed replication (which you already could have set up with mk-slave-delay from Maatkit.

Delayed slave is especially helpful if application can operate with just master as in this case you can switch very quickly (just skipping bad statements and catching up)

The main challenge in such failure is the fact you have many trashed copies to deal with. If you have just one or several small tables corrupted you can reload them. One option is to reload them on the master (and they will be replicated down to all slaves) the faster however (especially if you have many tiers of replication) is to bring all slaves to the same point in time and load data locally with SQL_LOG_BIN=0 set for session.

If the large portion of data trashed you may need to recover full database on all slaves which is best done in binary mode for large data sets. Such global recovery can also put very high stress on your network and backup storage and take a lot of time. It also may be extremely difficult to get the large backup in timely fashion over long distance network, meaning it is best to have local backup (and delayed slave if you use one) in each data center you have.

The complexity of recovery is another “liability” of compex replication tree setup. On the contrary sharded master-master pairs (or master with few slaves) are much easier to deal with.

Recovering the data with MySQL database replication you always have to keep replication positions in mind. Such if you recover master you need to recover slaves to matching snapshot – either it has to be same state (which is hard to manage) or you need to ensure you understand the position on the master to which backup corresponds to. This becomes more complex if you have complex replication hierarchy as slave only knows its position on its own master not on the “root” master.

Note there are also some solutions based on “Continuous Data Protection” class of backup which can be very helpful to go back in time with your data. One of vendors offering solution for MySQL is R1Soft. Though I have not had a chance to look at it in details.

What is about slave loss ? The slave loss is normally less of the problem. You can reclone slave from the master, another slave or restore from backup. So this is just question of having decent capacity planning (such as being able to shut off 2 slaves and still operate normally), have LVM setup if you want to avoid shutting off slave or master to clone the data and making sure the logs on the master go far back so you can restore from several of backup generations and do point in time recovery.

Timing recovery is also important. Especially for write intensive environments it may take many days to catchup from weekly backups by binary logs so make sure to time it properly.

In the real life environments can be even more complicated – one may use partial replication, replication to different storage engine, add some tables beyond tables which are being replicated which all has to be accounted for for in the process of replication.

It is also worth to note beyond these 3 main recovery scenarios there are number of other cases which you have to deal with (which often can be resolved by doing recovery be one of these 3 protocols, but you can also take as shortcut) – for example you may have master or relay binary log corruption. Master or Slave running out of space, Slave crashing (and loosing its position on master), Replication breakage (or running out of sync) due to MySQL bugs or wrong use.

Interesting enough very few people have their data recovery practices ironed out so they can answer how they would handle at least these 3 data loss cases for each of servers they have deployed. Even fewer have gone beyond theory and have tested the processes or have regular testing in place.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. We have deprecated the log mirroring (mirror binlog) code. It won’t be in the next Google patch. We will soon share code that implements global transaction IDs so that SQL commands can be used to map between (master log file, master log pos) and global transaction IDs. This will allow slave failover to be done by running a few SQL commands with no loss of transactions.

  2. peter says:

    Thanks Mark,

    Good to know. Indeed the global transaction IDs are better, though I guess it was harder to implement. Are you going to support only master-slave replication or master-master and tree replication w filtering and writing to intermediate nodes as well ?

  3. KirkH says:

    A bit off topic but I just did some benchmarking of InnoDB inserts on a RAMDrive and my Inserts per second went from 95 to about 9,600. From the InnoDB wikipedia article. “For typical hard drives or arrays, this will impose a limit of about 200 update transactions per second.”

    I ran IOMeter on the RAM Drive and got 80,000 random 50% read / 50 %write IOPS which is more than a RAMSAN 300. I’m looking at moving my app to the RAM Drive.

  4. peter says:

    It is indeed off topic but why would you be surprised ? Using spindle free storage can improve performance dramatically though it is also quite expensive. At this point it is so expensive in many cases you can be better of to have a lot of memory (128GB/box is getting affordable) and keeping data in memory – which is going to be faster still. I expect this to change as Flash technologies mature and get down in price though :)

  5. KirkH says:

    Thanks for the reply Peter. I did a lot of research on Flash SSD and came to the conclusion that random writes are often worse on SSDs than on conventional hard drives. I’m also not seeing the point of ACID compliance if the transactions aren’t written to a physical disk and the InnoDB log files are apparently the bottleneck here based on my testing today.

    I’m assuming that somehow there are ways to stay ACID compliant and get more than a few hundred inserts per second but I’ve not yet figured that out. Thanks for the blog!

  6. peter says:

    Well, When you’re looking at Inserts it is likely READs is what will limit you not writes (which happen in the background). The “dump” use of flash indeed will cause the issues you’re describing however there are solutions out there already which have various techniques to make random writes faster. Also you can use NVRAM same way as in RAID controllers to do fast ACID operations and write buffering. Anyway what I’m saying the flash may not be just there yet but it will come.

  7. @Peter – it will work with intermediate nodes because one of the goals is to support hierarchical replication. It does not support master/master. The code at a slave assumes the tree has a root, not two or more roots. It is possible to extend this to support master/master and other advanced features. I will leave that to the MySQL team.

  8. peter says:

    Good to know Mark.

    Do you have an idea what version is it scheduled for inclusion in the main MySQL tree ?

  9. @Peter – no, maybe MySQL product management can tell us?

  10. peter says:
  11. peter says:

    Mark. I just figured out I answered to the wrong post. This is targeted to the one where you asked about MySQL Contributors agreement.

    Regarding asking MySQL Management about release timing… I think you’re better off guessing yourself for the time being :)

Speak Your Mind

*