 As part of Percona Remote DBA for MySQL service we recognize that reliable backups are one of the most important things we can bring to the table. In my experience handling emergencies, the single worst thing that can happen is finding out you don’t have backups available when some sort of data loss or catastrophic event occurs.
As part of Percona Remote DBA for MySQL service we recognize that reliable backups are one of the most important things we can bring to the table. In my experience handling emergencies, the single worst thing that can happen is finding out you don’t have backups available when some sort of data loss or catastrophic event occurs.
With our Remote DBA service we can take care of backups for you, what follows are some of the internals of our implementation.
What kind of outages can happen?
- Someone runs UPDATE or DELETE and forgets the where clause or filters weren’t quite right
- The application had a bug causing data to be removed or overwritten
- A table (or entire schema) was dropped accidentally
- Your InnoDB table was corrupt and mysql shuts down
- Your server or RAID controller crashes and all data is lost on that server
- A disk failed, and RAID array does not recover
- You run into a InnoDB corruption bug that propagates via replication (not common, but does happen)
- You lose your entire SAN and all your DB servers were located there. Let’s hope your backups are somewhere else!
- You lose a PSU or network switch in your datacenter and some or all of your servers go down in that location
- Your entire datacenter loses power and the generators do not start, which happens more often than you might think
What tools do we use in Remote DBA?
- Percona XtraBackup for MySQL for binary backups
- mydumper for logical backups
- mysqlbinlog 5.6
- Amazon S3
- monitoring for all the above
Philosophy on backups
- It is a good idea to schedule both logical and binary backups. They each have their use cases and add redundancy to your backups. If there is an issue with your backup, it’s likely not to affect the other tool.
- Store your backups on more than one server.
- In addition to local copies, store backups offsite. Look at the cost of S3 or S3+Glacier, it’s worth the peace of mind!
- Test your backups, and if you have a test environment, load them there periodically. You can also spin up an EC2 instance to load your backups onto. In addition, you can binlog rollforward 24 hours of binlogs as a good test.
- Store your binlogs off your primary server so you can perform point in time recovery.
- Store your binlogs offsite for disaster recovery scenarios.
- Run pt-table-checksum periodically (i.e. once a month) and make sure your servers data stays consistent. Checksumming is important, as backups are typically pulled off a slave and it’s vital that it has the same data.
How do we use these components to give our customers reliable backups?
Think about the 10 example outages listed above. Each MySQL backup tool has it’s strong points given the conditions.
Percona XtraBackup for MySQL for binary backups.
- It can restore an entire server very fast. Often the limiter of how fast this can be restored to another server, is how fast you can transfer data over your network. If you have 1GB network and you have 1TB of data, it could take awhile.
- It can compress the DB on the fly
- It can backup a MySQL database server at approximately the maximum rate the server allows, given it’s IO system
- It can typically execute a backup with little to no major impact on the server. For example in xtrabackup 2.0.5+, the time taken for “FLUSH TABLES WITH RAED LOCK” is normally under 1 second.
- If you have a lot of non-transactional tables (i.e. myisam), use –rsync option. This will rsync a copy of all the frm files and all the MYD/MYI files. It then does a second rsync while under a global lock. This means where you may have been locked for hours where you had many non-transactional tables, now you can be locked sub-second. Even with innodb only this can greatly cut down on the lock time by syncing the frm files.
- Enable –slave-info when backing up from a slave so you know what the position you are in the master’s bin logs
- –compress option, compresses on the fly using qpress under the hood.
- Setting up new slaves
- When we lose an entire server due to hardware failure, corruption, etc
- When the majority of data on the server was lost. e.g. there is one primary schema and that schema was dropped. Basically when restoring may take less time that trying to load a logical backup.
Restoring your data from a MySQL backup is another topic. Piecing together data after accidental data loss is one of Percona’s specialties, and there are many different techniques depending on the scenario. I will go through some of these in detail in a future blog post.
Mydumper for logical backups
- Very fast for logical backups – compared to mysqldump
- Consistent backups between myisam and innodb tables. Global read lock only held until myisam tables are dumped.
- We are researching into how we could further improve lock times here when non-transactional tables are
 used
 
- We are researching into how we could further improve lock times here when non-transactional tables are
- Almost no locking, if not using myisam tables
- Built in compression
- Each table is dumped to a separate file. This is very important to make restoring single tables easy. You can quickly restore a single table, instead of restoring your entire backup just to find a tiny table you need. This is actually the most common type of restore needed, so it’s important to make this operation as painless as possible.
- Compressed mydumper typically 3x-5x smaller vs compressed xtrabackup
- Typically we upload mydumper backups to s3 vs xtrabackup given the time needed to upload/download. Though it depends on the available bandwidth and should be factored into your restore time.
Problems:
- You can’t rely on mydumper to dump schema’s. It does not handle views/triggers/procedures etc. Run with –no-schemas, instead use mysqldump for the schemas and rely on mydumper for data only.
- You will have to compile it yourself as binary packages aren’t distributed
- Be careful with importing a dump from a server running in a different timezone. We have a fix here.
- loop through each DB
- write out ALTER DATABASE DEFAULT CHARACTER SET <charset> to the schema file, putting in the current charset
- mysqldump … -d -R –skip-triggers, out to the schema file
- create a schema-post file that has the triggers # mysqldump … -d -t
 
- Load the schema file
- Run myloader –threads=x
- Load the schema-post file
- run with –kill-long-queries to avoid nasty problems with “FLUSH TABLES WITH READ LOCK”
- –compress, compresses tables per file and should typically be enabled by default. The time needed to uncompress is not a limiting factor on restore time when done inline.
- Restoring a single file
- Restoring a single schema or rolling forward a single schema to a point in time
- Restoring data while automatically replicating out to all slaves
mysqlbinlog 5.6
Last year Percona IT director Tamas Kozak had a great blog post that showed how mysqlbinlog in 5.6 could be used. With mysqlbinlog 5.6, you can now pull binary logs in real time to another server using “mysqlbinlog … –read-from-remote-server –raw –stop-never”
- Useful to mirror the binlogs on the master to a second server.
- Allows you to roll forward backups even after losing the master
- Very useful for disaster recovery.
- You can have your backups in S3 and mysqlbinlog –stop-never running on a small ec2 instance. This can allow for a very low cost disaster recovery plan to ensure you will not lose data even in the worst case scenarios.
- Takes very little resources to run, can run about anywhere with disk space and writes out binlog files sequentially.
- Ensure it stays running, restart it if it appears to be hanging
- Verify the file is the same on master and slave
- Re-transfer files that are partially transferred
- Compress the files after successful transfer
Amazon S3 for MySQL
- s3cmd – we have been using the version from github, Mostly for multi-part upload support. This prevents us from having to split files up before uploading to S3.
- There is released alpha version of this version here
- You can now set bucket lifecycle properties so data over X days is archived to Glacier and data over Y days is removed. This is very convenient feature and allows you to cost effectively store long term backups with little additional work
- –add-header=x-amz-server-side-encryption:AES256 to use the server side encryption feature which helps with some types of compliance. We also have the capability to encrypt all files with gpg prior to upload via a separate script
- use_https = True, especially if your data is not encrypted before transfer
Monitoring
- Monitoring is the most important piece to tie all of these process together. We employ nsca nagios alerts for all of the backup processes.
- freshness_threshold should be set so if your nsca hasn’t checked in within a certain period it will alert you. For example if you backup the MySQL database once a day a good threshold could be 36 hours.
- For our mysqlbinlog processes, we have it sending nsca alerts every 30 seconds and have it alert when nothing has been received for 15 minutes -> 1 hour
- If backups throw an error and are aborted, we send a critical alert immediately to be investigated
- The number one cause of backup alerts are due to problems with “FLUSH TABLE WITH READ LOCK”. Namely when a select is blocking the flush from completing and queuing all requests behind it. Our current solution to deal with this issue is we have a guardian process that runs during a backup. It then kills any process that causes a stall of the flush. We are also researching into other ways that could improve this in the future.
Other details on Percona Remote DBA for MySQL backup systems for future posts
- Detailed strategies for different types of restores
- Strategies on retention dailies weeklies, long term backups
- Decompressing Percona XtraBackup for MySQL in parallel using all your resources available
- Downloading from s3 in parallel
- Parallel encryption/description
- Hardlinking of backups. Given both our mydumper and xtrbackup are seperated by file, for files that don’t change they can be easily hardlinked to typically save 20-80% of space locally
 
 
 
 
						 
						 
						 
						 
						 
						
There is one very important aspect that is not covered. Restoring from logical backups of ainnodb example set
Of databases could take a very long amount of time.
Due to indexes usualky.
Dumping and saving is fast.
Restoring is very slow.
And when the bottleneck is the disk i/o on the restore you
Can’t do paralel stuff.
What is your opinion on this ?
Should a backup be held by dumping and restoring on a temporary backup server, just in case a crash occurs on the original
One ?
Any remarks on backups on a XtraDB cluster setup?
Xtrabackup is great – having slightly slow backups is no problem (because you mostly don’t care how long it takes, so long as it’s less than a day!), but the fast restore times are great when you’re in a bad place. It’s all very well having xtrabackup do compression, but in order to check the backup (prior to restore), you need to apply logs, and you can’t do that on a compressed copy. It would be very useful to have the apply logs step to happen before the backup is compressed as otherwise you have to compress, uncompress, apply logs, recompress, which is very wasteful. I’m keen on doing this as in the past xtrabackup has had bugs at just that point that meant that backups would not restore. Is it possible for xtrabackup to reorder that sequence so it applies logs before compressing?
I am entry level SQL.
But this is very informative.
Thank you.
Mike, I agree logical backups can take a long time to load. Some factors that affect speed of restore include innodb_log_file_size, innodb_buffer_pool_size, innodb insert buffer (default to 50% of buffer pool), and of course available disk IO. There are also several other innodb setting that affect the number of background threads doing IO etc. With SSD’s in particular using the multi-threaded capabilities of myloader can result in much faster loads. With conventional drives it may not speed things up nearly as much. I recommend testing restores using different number of restore threads and measure time taken. Compare this to the time taken to doing a binary restore so you know when it’s appropriate to begin one type vs the other. Remember the primary purpose of logical restores are when retrieving a subset of data such as a single table or a single schema which is only a fraction of the size of the overall data. Measure the time taken to know which method will work best given the circumstances.
Dimine, the main thing with XtraDB cluster is the time taken with “FLUSH TABLES WITH READ LOCK”, it’s important that the time the lock is held is reduced as much as possible. For XtraBackup the main factors that will reduce this time are:
* using the –rsync option to sync frm files and any non innodb tables
* use XtraBackup 2.0.5+
For both XtraBackup and logical backups it’s important to make sure no long running queries are executing at the time when the “FLUSH TABLES WITH READ LOCK” begins. For example you may want to remove it from the list of servers receiving traffic during the backup window.
Mike, one other note is you can greatly speed up restores by using –innodb-optimize-keys option of mysqldump in Percona Server. Basically you create the table minus secondary keys to load the data, then add the secondary keys at the end. Though this option is currently incompatible when using foreign keys https://bugs.launchpad.net/percona-server/+bug/1081016
Marcus, it’s not possible for xtrbackup to apply the logs before compressing. Some options are:
* don’t use compression, assuming you have the available disk space, then compress the copies older than 1 day later. Obviously this incurs a lot more disk IO.
* periodically test your xtrabackup’s, including the apply phase (not daily) and alert on the status in your monitoring system.
* use logical backups for redundancy incase there are problems with apply phase. At least you have other options though slower if you need to restore the entire server.
Thanks for this information …