Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

How would you compress your MySQL Backup

June 6, 2008

Author

Peter Zaitsev

Insight for DBAs

Share this Post:

Backing up MySQL Database most people compress them – which can make a good sense in terms of backup and recovery speed as well as space needed or be a serious bottleneck depending on circumstances and approach used.

First I should mention this question mainly arises for medium and large size databases – for databases below 100GB in size compression performance is usually not the problem (though backup impact on server performance may well be).

We also assume backup is done on physical level here (cold backup, slave backup, innodb hot backup or snapshot backup) as this is only way practical at this point for databases of decent size.

Two important compression questions you need to decide for backup is where to do compression (on the source or target server if you backup over network) and which compression software to use.

Compression on source server is most typical approach and it is great, though it takes extra CPU resources on the source server in additional to IO resources which may not be available, especially for CPU bound MySQL Load. The benefit in this case is less space requirement if you’re keeping the local copy as well as less network bandwidth requirements in case you’re backing up to network storage.

Compression on the destination server offloads source server (though it may run our of CPU itself, if it is target for multiple backups, plus there are higher network bandwidth requirements to transfer uncompressed backup.

What is about compression tool ? The classical tool used for backup compression is gzip – it exists almost everywhere, it is stable and relatively fast.

In many cases however it is not fast enough and becomes the bottleneck for all the backup process.

Recently I did a little benchmark compressing 1GB binlog file with GZIP (compression done from OS cache and redirected to /dev/null so we only measure compression speed). On the test box with Intel(R) Core(TM)2 Duo CPU E4500 @ 2.20GHz CPU. GZIP would compress this file in
48 seconds (with default options) resulting in 260MB compressed file. This gives us compression speed of about 21MB/sec – clearly much less than even single SATA hard drive can read sequentially. This file when will take about 10 seconds to decompress, meaning source file will be read at 26MB/sec to do decompression – this is again much less than hard drive sequential read performance, though the fact this gives us about 100MB/sec of uncompressed data writing is more of the issue.

Such performance also means if your goal is faster local network transfer default GZIP compression will not speed things up on the standard point to point 1Gbit network connection.

If we try gzip -1 to get fastest compression we get the same file compressed to 320MB in 27 seconds. This gives us 37MB/sec which is a lot better but still not quite enough. Also note the serious leap in compressed file size. Though in this example we used MySQL binary log file which often contains plenty of similar events, which could be the reason for so large size difference based on compression ratio. The decompression takes about same 10 seconds which gives about 32MB/sec of archive read speed and same 100MB/sec of uncompressed data.

Do we have any faster alternatives to GZIP ? There are actually quite a few but I like LZO which I was playing with since later 1990’s and which is rather active project. There is also GZIP like command like compressor using LZO library called LZOP which makes it easy drop in replacement.
I got LZOP binary which was built against LZO 1.0, more resent version 2.0 promises further performance improvements especially on 64bit systems.

With LZO default compression file compressed in 10.5 seconds and resulted in 390MB compressed file, this gives us 97MB/sec compression speed which is good enough to compress all data you can read from single drive. The file decompresses in 3.7 seconds which gives 105MB/sec read speed from archive media and 276MB/sec write speed to the hard drive – this means restoring from backup compressed with LZO will often be as fast or faster as from not compressed one.

With LZO there is also “-1” option for even faster compression which had rather interesting results. The file compressed in 10.0 seconds (102MB/sec) and was 385MB in size – so this lower compression rate actually compressed this a bit better while being about 5% faster. The decompression speed was about the same. I’m sure the results may change based on the data being compressed but it looks like LZO uses relatively fast compression by default already.

With real server grade CPU deployment the performance should be even better, meaning you should get over +-100MB/second you can pass through 1Gbit ethernet, meaning you actually can use LZO compression for faster data transfer between the boxes (ie together with netcat)

Now as in my benchmarks there is also overhead of reading (from file cache) and piping to the /dev/null which are constant the true difference in compression speed is even larger, though as most of backup operations will need reading and writing anyway they come with this static overhead naturally added.

UPDATE: It looks like people are wondering how BZIP2 compares so I should check it before I delete this particular file. BZIP compression for this file took 298 seconds which is just 3.4 MB/sec though compressed file was just 174MB in size. Decompression took 78 sec which means compressed data was read at 2.2MB/sec and result was generated with 13 MB/sec.

For all archivers it is possible to use parallel compression to get better speed though this also means a higher load which can be the issue if you’re not using dedicated server for backups.

I should also note for mysqldump backup typically tools with better and slower compression make sense because it takes longer to dump and much longer to load to the database anyway so overral compression impact is less than for physical level backup.

0 0 votes

Article Rating

29 Comments

Oldest

Newest Most Voted

Petya

18 years ago

with lzop I lose 30% in backup size:
6G by lzop compared to 4G by gzip

trying lzop -9, but it works way too slow,
than lzop -3 (default)

George

18 years ago

I really do not have issues with the backup and compression as it really does not affect that much (20-30 minutes on a ~30 GB database and ~16 GB dump file size for the whole process done by a php script). The issue arises when restoring from that dump file. In any case I`ve encountered I`ve had problems with restoring InnoDB tables larger then 2KK rows. The restore takes as long as 3 hours on the same files and the downtime causes a serious bottleneck as there is always new data to be inserted or updated. In cases requiring restore from backup we have a slow server issue for up to 8 hours (should mention that we add data to the database on a minute-size time line so we have scripts running always).
There would be interesting to have a case study on a medium database restore with different server settings. I was planning to have this done at home but unfortunately the time is getting shorter now. I`m curious what would happen if server variables are changed and how should they be changed in order to obtain a faster restore. What should there be optimized to obtain this? I m positive the server has to be set differently then on normal runtime.
I will do such a study in the future hopefully.

Jan

18 years ago

I’ve been reading this website for some time now … and though I only dream about working with such a large amount of data, where every single piece of code affects the overall speed (therefore I can play a lot to tune the speed of things up), I find informations found here very satisfying, interesting and I always look forward to see the new “tales from the land of real databases” 🙂

thanks … great work.

ps: sorry for my offtopic

Steve

18 years ago

I guess we’re sacrificing space for speed here rather. I’d be interested if you could redo these benchmarks again with the LZO 2.0 instead of the old 1.0 version.

gslin

18 years ago

You may try pigz (parallel gzip), it compresses parallelly.

Mat

18 years ago

You may also try the LZOP patch to enable SMP 🙂
http://lemley.net/lzop_patches/lzop.html

Baron Schwartz

18 years ago

George, Peter is referring to file copies of the physical database files, not dumping with mysqldump. Your issues will only become worse as the databases grow, which is why Peter said a physical backup is the only option at larger sizes.

Mark Callaghan

18 years ago

But the behavior George describes is how online backup will support InnoDB (select from all tables to dump, insert to all tables to restore). Not sure if that will be popular.

Sheeri

18 years ago

We have clients with large databases (>100Gb) using the Innodb Hot Backup solution with no problems, and using the compression within the hot backup tool. Just as a data point.

Antonio

18 years ago

If you want the best possible compression ratio go with mysqldump + bzip. I can squeeze a 130gb database into about 6.5gb. Compression boils down to trading CPU computation for space savings. You can emysqldump first to minimize backup time, then compress. The secret is to use bzip2smp or some other parralelized bzip implementation. bzip2smp is fast enough for me that it is not the bottleneck (on my 8 core server).

I get almost 50% better compression of mysqldump files with bzip2smp and it runs twice as fast as gzip.

Mark Callaghan

18 years ago

ibbackup works great for me on datasets >= 100GB. I have a wish list and I can’t implement it myself as the source is closed. I want to avoid staging files on disk during backup (write to a socket). I know people who have done magic with LD_PRELOAD, but I prefer a simpler solution. I also want parallel IO (many threads) running for both backup and restore to copy in/out the large files.

Author

Peter Zaitsev

18 years ago

Petya,

Sure. LZO compression is lower but at least it is getting rid of all this “whitespace” you have in the database. You can use bzip2 if you need high compression though it is very slow.

Space is cheap, so my main point is often doing backups quickly and with limited impact on production as well as being able to restore them quickly.

Author

Peter Zaitsev

18 years ago

Steve,

I just happen to see latest lzop for my OS to come with lzo 1.x I can’t right now spend time on getting it work with LZO 2.0 and in any case you should see this as some ballpark figures – a lot will depend on exact server configuration, data being compressed etc.

Author

Peter Zaitsev

18 years ago

Mark,

Sure if Innodb/Oracle will not implement real backup plugin for MySQL Online Backup it will be pretty much worthless. Though I guess it is tough call for them as Innodb Hot Backup sales can be significant.

Author

Peter Zaitsev

18 years ago

Sheeri,

The question is how fast backup and restore can take place, in general any compression would work.

Author

Peter Zaitsev

18 years ago

Mark – you have probably enough experience in the Innodb code now to re implement Open Source backup version. It actually should not be that bad as it just have to keep log copied and selectively apply them on the data set copy. I remember Brad Fitzpatrick looked into implemented it but decided to keep Heikki and Innodb friends by not undercutting business from them 🙂

Mark Callaghan

18 years ago

Peter,
That would only happen if they abandoned the code. I am not a big fan of hot backup in general. I prefer that backups get taken from a slave, and at that point hot backup is not needed. There are so many options that I don’t understand why MySQL is building online backup (ibbackup, filesystem snapshots, take a backup from a slave).

Author

Peter Zaitsev

18 years ago

Right. If you backup from the slave cold backup is probably the most failsafe.

Backup from slave is good if you can ensure you’re not backing up slave which is out of sync w master for whatever reason.
I’ve seen many times things getting out of sync without noticing for a long while and backup in such case is dangerous.

Though putting backup load on the master is often not the option either.

Pat

18 years ago

Got a question about snapshot type backup:

If I snapshot a myisam file system that is in use, don’t I get a ‘corrupt’ file system? It’s entirely possible that I’ll snapshot the system with a variety of operations “half” done isn’t it e.g. I’ll have the 4 of the 7 data blocks for an update written out but none of the index updates, etc.

Likewise, if it takes 5 minutes to do the snapshot, won’t the blocks taken at the top of the snapshot match up wrong with the blocks taken at the bottom (which represent a database state 5 minutes later)?

I keep seeing snapshot solutions offered as a way to do hot backups, but I can’t figure out how to do it w/o either A) shutting down the database (cold backup), or B) introducing corruption.

What’s the trick?

Author

Peter Zaitsev

18 years ago

Pat,

If you’re using MyISAM you need to do:

FLUSH TABLES WITH READ LOCK;

UNLOCK TABLES

This means there will be short (or long) read only moment but there will be no MyISAM corruption.

BTW – tools like mylvmbackup already can do this for you and implement various tricks to make the read only time shorter.

Kyo

18 years ago

I consider bzip2 obsolete for all practical purposes, at least as far as my machines are involved. LZMA is so much faster (especially at decompressing) while often compressing even much better than bzip2 that the only use of bzip2 that I still have is for exchanging data with the outside world. If you are not too picky about the format (I am not, at least for archival purposes) and very demanding as far as speed is concerned, p7zip wonderfully exploits multiple cores for both compression and decompression – try it on a quad core machine. 😉 Other than that, GNU tar now also integrates LZMA, as far as I know. And SUSE, for example. decided to go with LZMA in RPM payload in the upcoming release of their distro. The packages are often much smaller and install somewhere from two to three times faster.

jeffatrackaid

18 years ago

We are usually pulling backups from slaves who’s main purpose is backups, so I’ve not done may benchmarks. But on some standalone boxes where I have to compress a large amount of data, I sometimes nice my process to a friendly level. Not sure if this really makes much of an impact, just a habit I developed. There is another interesting tool called 7zip (http://www.7-zip.org/). Windows and Linux friendly. I’ve not benchmarked it for speed but the native format does very well on HTML files — pushed 3GBs of HTML into just 30MB.

Mamatha

18 years ago

I wanted to know the compression ratio of LZO. IF I have a file of 12 MB , what is the size of the compressed file?

Author

Peter Zaitsev

18 years ago

Mamatha,

The compression ration depends on data a lot. For all compressors I tested I specified the result file size. Raw file was 1GB.

art

18 years ago

What about rar?

joel

18 years ago

I actually ran into this problem a few days ago. While taring a lvm snapshot one of my cores was pegged at 100%. After a bit of googleling I found the above mentioned pigz. Using it I cut my backup time to 1/4 of the original time. Backups with pigz are restorable with regular old gzip. There is also a parallel bzip2 out there, pbzip2, but I did not try it as it does not accept stdin.

Lasse Reinhold

18 years ago

I’m creating a high performance file archiver, qpress, which uses asynchronous I/O (reader and writer thread). On Windows it’s also non-buffered to prevent the cache of other server applications from being flushed. It’s using the QuickLZ 1.40 compression library and is intended for speed, not good compression ratio.

I’d be glad for any feedback on performance, bugs or ideas for new feature. Download a prototype at:

http://www.quicklz.com/qpress.zip

Serge Knystautas

17 years ago

I saw the LZO SMP reference, so thought I’d look up the GZIP equivalent…

http://lemley.net/mgzip.html

My servers are all multicores and I’m wondering with my next backup if I can’t accelerate GZIP significantly by spreading is across multiple cores and get closer to that 100MB/s SATA speed.