Disaster: LVM Performance in Snapshot Mode

In many cases I speculate how things should work based on what they do and in number of cases this lead me forming too good impression about technology and when running in completely unanticipated bug or performance bottleneck. This is exactly the case with LVM

Number of customers have reported the LVM gives very high penalty when snapshots are enabled (leave along if you try to run backup at this time) and so I decided to look into it.

I used sysbench fileio test as our concern is general IO performance in this case – it is not something MySQL related.

I tested things on RHEL5, RAID10 volume with 6 hard drives (BBU disabled) though the problem can be seen on variety of other systems too (I just do not have all comparable numbers)

O_DIRECT RUN

The performance without LVM snapshot was 159 io/sec which is quite expected for single thread and no BBU. With LVM snapshot enabled the performance was 25 io/sec which is about 6 times lower !

I honestly do not understand what LVM could be doing to make things such slow – the COW should require 1 read and 2 writes or may be 3 writes (if we assume meta data updated each time) but how ever it could reach 6 times ?

It looks like it is the time to dig further into LVM internals and well… may be I’m missing something here – I do not have the good insight on what is really happening inside, just how it looks from the user.

Interesting enough VMSTAT confirms there should 1 read and 2 writes theory:

As you can see there are about twice as many writes as reads.

SMALL FILE RUN

When I decided to check how things improve in case writes come over and over again in the same place – my assumption in this case would be to have overhead gradually going to zero as all pages become copied and so writes can just proceed normally.

With this run I got approximately 200 ios/sec without LVM snapshot enabled while with snapshot I got:

33.20 Requests/sec executed
46.08 Requests/sec executed
70.79 Requests/sec executed
123.68 Requests/sec executed
157.66 Requests/sec executed
163.50 Requests/sec executed

(All were 60 second runs)

As you see the performance indeed improves though there is still significant overhead remains. The progress is much slower than I would anticipate it. Before last run there were about 400MB totally written to the file (random writes) which is 6x of the file size and yet still we saw some 20% regression compared to run with no snapshot.

NO O_DIRECT RUNS

As you might know O_DIRECT often executes quite special path in Linux kernel so I did couple of other runs. First run syncing after each request instead of O_DIRECT

This run gave 162 io/sec without snapshot and 32 io/sec with snapshot. The numbers are a bit better than with O_DIRECT but the gap is still astonishing.

The final run I did is emulating how Innodb would do buffer pool flushes – calling fsync every 100 writes rather than after each request:

This gets some 740 req/sec without snapshot and 240 req/sec with snapshot. In this case we get close to expected 3x difference.

The numbers are much higher in this case because even though we have one thread OS is able to submit multiple requests at the same time (and drives can execute them) – I expect if there would be BBU in this system we would see similar results for other runs.

So Creating LVM snapshot indeed could cause tremendous overhead – in the benchmarks I’ve done it is ranging from 3x to 6x. It is however worth to note it is worse case scenario – many workloads have writes going to the same locations over and over again (ie innodb circular logs) – in this case the overhead will be quickly reduce. Though still it takes some time and I would expect any system doing writes to experience the “performance shock” when LVM snapshot is created with greatly reduced capacity which when will improve as smaller number of pages actually need to be copied on writes.

Because of this behavior you may consider not starting backups instantly after LVM snapshot is creating but allowing it to settle a bit before further overhead with data copying is added.

The question I had is how could LVM backups work for so many users ? The reality is – for many applications write speed is not so critical so they can sustain this performance drop, in particular during some slow times (which often have 2x-3x lower performance requirements)

So we’ll do some research around LVM and I hope to do more benchmarks – for example I’m very curios how good is read speed from snapshot (in particular sequential file reads).

If you’ve done some LVM performance tests yourself or will be repeating mine (parameters posted) please let me know.

Share this post

Comments (55)

  • Nils

    I think you can use a separate device as COW device, I’m thinking of a ram disk or small SSD, this should lower the write load on the main disk array.

    February 5, 2009 at 11:54 pm
  • peter

    Nils actually it was separate device in this case (though it was not SSD or memory). I would expect the overhead is still to be significant because to do the copy you need to read the old data so you’re looking for extra read on the same block device for each write. It is especially bad because writes can often be buffered and happen in the background while reads have to happen at once.

    February 6, 2009 at 12:36 am
  • Lenz Grimmer

    Thanks for the analyis, Peter! This seems to confirm the impression that I had from my experiences with mylvmbackup. Have you investigated if the performance drops even more, if multiple snapshots are active at the same time? I wonder why the Linux LVM snapshot implementation performs so badly in comparison to ZFS snapshots (which seem to add only very little overhead).

    LenZ

    February 6, 2009 at 3:34 am
  • Nils

    LenzZ:
    ZFS uses copy-on-write transactions for every write, so it can simply retain the old block for the snapshot instead of discarding it, so there is less additional overhead involved.

    February 6, 2009 at 4:02 am
  • Baron Schwartz

    Peter, I think we should NOT assume the performance shock goes away after the COW space isn’t getting more writes… We can run some more benchmarks to see.

    1) Just make a snapshot on a volume that’s not getting any writes. Measure whether there is a read performance penalty now on either the original or the snaphsot volume. If there is, it would look like a bug to me.

    2) Make a snapshot, write a bunch of stuff, then stop writing and repeat step 1). You would now expect a read penalty on the snapshot volume but not the original volume.

    February 6, 2009 at 7:16 am
  • peter

    Baron,

    I did some benchmarks – see results for small file – as all file was copied in the snapshot performance regression reduced dramatically.

    February 6, 2009 at 8:37 am
  • http://scale-out-blog.blogspot.com/

    Hi Peter, thanks for this analysis. We are working on operating MySQL on Amazon and should be doing some performance analysis of Elastic Block Store (EBS), which–you guessed it–is enabled for snapshots. I’ll make a point of rerunning your sysbench cases. In this case our comparison would need to be EBS vs. local file systems, which so far as I know are not SAN-based and should provide a reasonable reference for comparison.

    February 6, 2009 at 9:53 am
  • http://scale-out-blog.blogspot.com/

    Oops, that last comment had my website instead of a name. Silly me. Robert

    February 6, 2009 at 9:54 am
  • peter

    Lenz, Nils

    ZFS works differently. LVM as I understand does in place update storing the previous version in the snapshot space while ZFS generally writes data to the new locations. It is very interesting for me how (if somehow) ZFS is able to deal with fragmentation which happen if you do not update in place (whenever in snapshot mode or not)

    With LVM – I’m really curious if it is possible to have it operate in more relaxed mode. Really – if we do not care about snapshot surviving crash (and we do not in case of backup) it is possible to make things much more optimal buffering writes to undo space and having large sequential writes even for random IO.

    Also with rise of SSDs whole another thing happens – Random IO is not expensive any more and so writing to the new location on update would not slow down scans.

    Actually I would really like to see SSDs providing hardware snapshot capabilities natively – they anyway keep the old data for a while and just could provide access to it and allow to pause purging.

    February 6, 2009 at 11:49 am
  • Dieter_be

    What do you mean with “bbu disabled” ? no write-back caching?

    February 6, 2009 at 12:28 pm
  • peter

    Dieter,

    Right. No WriteBack caching on this one. Though results are similar with cache.

    February 6, 2009 at 1:12 pm
  • Matthew Kent

    Very interested in the read numbers as well. Please post 🙂

    February 6, 2009 at 1:24 pm
  • Perrin Harkins

    Can you suggest an alternative for consistent backups that works for both InnoDB and MyISAM tables?

    February 6, 2009 at 1:47 pm
  • Coway

    LVM is not bad as what you thought. So as RAID 5.

    It is not a surprise to see the performance drop in Peter’s test. But you cannot judge something based on one aspect. This also applies in the testing. Test results vary a lot with the tools you choose. In a brief, the test results from ‘sysbench’ may be biased, or limited without comparing test results from other tools.

    I used bonnie, and bonnie++, and iozone to test on our hardwares (IBM x3650, or Dell PE). The test results are consistent across tools used sometimes, or they are the opposite from different tools. So you need make a thorough comparison and choose a right plan for your environment.

    The following is part of my test results:

    To generate a 16G file using 8G RAM in Bonnie++

    ./bonnie++ -r 8000 -s 16000 -u root

    Sequential Output Sequential Input Randoms Seeks ( /sec) time to finish
    file size(MB) Per Char Block Rewrite Per Char Block
    RAID 10, LVM 16000 83 173 120 101 1670 1629 9’42”
    RAID 10 16000 88 178 167 97 2777 1630 9’31”
    RAID 5 16000 88 232 124 99 330 1359 10’28”
    RAID 5, LVM 16000 85 233 122 99 350 1359 10’35”

    It shows LVM will add some overhead, but really can be neglected. Even RAID 5 has better performance compared to RAID 10 (see sequential output block)!!!

    iozone gives even more statistics. It lists the numbers for Write,Read,Random Write,Random Read,Re-write,Re-read. Let’s just pick up Random Write and Random Read because of their importance:

    RAID 10 with LVM vs. RAID 10 only:

    Random Write:
    For small files (<16k) creation, no much difference; For large files creation, LVM causes the speed drops from 1300000 to 900000.

    Random Read:
    For small files (<16k) creation, no much difference; For large files creation, LVM causes the speed drops from 3000000 to 2000000.

    After a painful comparison and testing, I finally chose RAID 5 + LVM for MySQL server.
    I chose RAID 5 instead of RAID 10 since I didn’t see much gain from RAID 10 in my situation and I want more disk space. LVM’s overhead doesn’t bother me too much and I use LVM snapshot to backup 200G databases.

    In the production, RAID 5 + LVM (IBM x3650, database size 200G) gives me 7000 qps insert at peak, compared to 9000 qps insert from previous RAID 10 without LVM (Dell poweredge, database size 600G). We scaled out the database so the new database is 1/3 of the original. You can argue they are different hardware and in different database sizes and I am comparing apple to banana. Anyway, We are happy about the result from RAID 5 + LVM so far.

    February 6, 2009 at 2:21 pm
  • peter

    Coway,

    SysBench is specially implemented to emulate IO for MySQL/Innodb and so this is why I use it.

    Also what did you compare here. Did you compare just raw partition to LVM ? In this case LVM really adds very little overhead or did you compare LVM to LVM when SNAPSHOT for target partition is enabled ?

    February 6, 2009 at 11:30 pm
  • Coway

    Peter,
    You are right. I just compared raw partition to LVM. I mainly use snapshot for MySQL backup and I do expect the backup time is as short as possible (say less than 1 hour), so the slowness when a snapshot for the target partition is enabled can be shortened. I will check the performance penalty during one snapshot is enabled.

    February 7, 2009 at 6:36 pm
  • peter

    Coway,

    The snapshot is the point. The backup time may be short but if your write IO capacity drops to 1/5th of normal during this time it just may be not enough to handle the load. It is good if you have night when system is rather idle but not all systems are like it.

    February 7, 2009 at 7:52 pm
  • John Laur

    I love it when people say “LVM is fine because I ran Bonnie!” — LVM is a complete dog with snapshots. This is one of those things where it is absolutely immediately obvious to anyone who has ever tried to use it for this purpose. I am seriously surprised how many distros simply enable LVM as a matter of course these days and how many tools are built on top of this poor performer. In my own experience there are only two storage solutions (it’s not exactly proper to call LVM or any of these other “filesystems”) that hold up when using snapshots currently – ZFS and WAFL.

    February 8, 2009 at 10:40 am
  • John Laur

    I should add “that I have tested” — I think probably veritas and oracle have filesystems that do a good job also, but I haven’t looked. EMC’s snapshots are garbage too, though only give you about a 10-15% hit; so not nearly as bad as LVM

    February 8, 2009 at 10:42 am
  • Kevin Burton

    This is our experience with LVM….

    It can be DOG slow when a snapshot is live. We’ve avoided this problem by first doing an InnoDB warmup when the DB starts up. When we need to clone a new slave reading the snapshot and COW operations can happen internally as the InnoDB buffer pool allows our DB to avoid having to do writes while catching up on replication.

    Though, this would be one benefit of having the InnoDB buffer pool use the linux page cache. Transferring the snapshot from the LVM image would come from the cache without having to touch the disk.

    … it probably goes without saying but the performance improvement you saw was because the COW was being completed and no more blocks were on the source disk.

    February 8, 2009 at 10:42 pm
  • peter

    Perrin,

    File system based stansot may have lower overhead. You can also check out R1Soft Backup. Some people have good success with it other reported performance and data restore problems though.

    February 9, 2009 at 12:34 am
  • Baron Schwartz

    So based on these benchmarks, Peter do you think it’s worth revisiting the “InnoDB I/O freeze” patch idea? What I think would be good is to freeze datafile I/O and let the log writes continue so InnoDB’s operation isn’t stopped. So we ought to see if the patch can be extended to freeze either/or type of writes.