Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Disaster: LVM Performance in Snapshot Mode

February 6, 2009

Author

Peter Zaitsev

Benchmarks

Hardware and Storage

Share this Post:

In many cases I speculate how things should work based on what they do and in number of cases this lead me forming too good impression about technology and when running in completely unanticipated bug or performance bottleneck. This is exactly the case with LVM

Number of customers have reported the LVM gives very high penalty when snapshots are enabled (leave along if you try to run backup at this time) and so I decided to look into it.

I used sysbench fileio test as our concern is general IO performance in this case – it is not something MySQL related.

I tested things on RHEL5, RAID10 volume with 6 hard drives (BBU disabled) though the problem can be seen on variety of other systems too (I just do not have all comparable numbers)

O_DIRECT RUN

/tmp/sysbench --test=fileio --num-threads=1 --init-rng=on --max-time=60 --file-num=1 --file-total-size=8G --file-extra-flags=direct --file-test-mode=rndwr run

1	/tmp/sysbench --test=fileio --num-threads=1 --init-rng=on --max-time=60 --file-num=1 --file-total-size=8G --file-extra-flags=direct --file-test-mode=rndwr run

The performance without LVM snapshot was 159 io/sec which is quite expected for single thread and no BBU. With LVM snapshot enabled the performance was 25 io/sec which is about 6 times lower !

I honestly do not understand what LVM could be doing to make things such slow – the COW should require 1 read and 2 writes or may be 3 writes (if we assume meta data updated each time) but how ever it could reach 6 times ?

It looks like it is the time to dig further into LVM internals and well… may be I’m missing something here – I do not have the good insight on what is really happening inside, just how it looks from the user.

Interesting enough VMSTAT confirms there should 1 read and 2 writes theory:

 0  1      0 24132256  73252 8248788    0    0   259   590 1271  531  0  0 92  8  0
 0  1      0 24135976  73284 8244964    0    0   413   938 1427  761  0  0 87 12  0
 0  1      0 24139572  73308 8241300    0    0   399   905 1412  736  0  0 87 12  0
 0  1      0 24143416  73352 8237396    0    0   409   927 1416  739  0  0 87 12  0

0 1 0 24132256 73252 8248788 0 0 259 590 1271 531 0 0 92 8 0

0 1 0 24135976 73284 8244964 0 0 413 938 1427 761 0 0 87 12 0

0 1 0 24139572 73308 8241300 0 0 399 905 1412 736 0 0 87 12 0

0 1 0 24143416 73352 8237396 0 0 409 927 1416 739 0 0 87 12 0

As you can see there are about twice as many writes as reads.

SMALL FILE RUN

When I decided to check how things improve in case writes come over and over again in the same place – my assumption in this case would be to have overhead gradually going to zero as all pages become copied and so writes can just proceed normally.

/tmp/sysbench --test=fileio --num-threads=1 --init-rng=on --max-time=60 --file-num=1 --file-total-size=64M --file-extra-flags=direct --file-test-mode=rndwr run

1	/tmp/sysbench --test=fileio --num-threads=1 --init-rng=on --max-time=60 --file-num=1 --file-total-size=64M --file-extra-flags=direct --file-test-mode=rndwr run

With this run I got approximately 200 ios/sec without LVM snapshot enabled while with snapshot I got:

33.20 Requests/sec executed
46.08 Requests/sec executed
70.79 Requests/sec executed
123.68 Requests/sec executed
157.66 Requests/sec executed
163.50 Requests/sec executed

(All were 60 second runs)

As you see the performance indeed improves though there is still significant overhead remains. The progress is much slower than I would anticipate it. Before last run there were about 400MB totally written to the file (random writes) which is 6x of the file size and yet still we saw some 20% regression compared to run with no snapshot.

NO O_DIRECT RUNS

As you might know O_DIRECT often executes quite special path in Linux kernel so I did couple of other runs. First run syncing after each request instead of O_DIRECT

/tmp/sysbench --test=fileio --num-threads=1 --init-rng=on --max-time=60 --file-num=1 --file-total-size=8G --file-fsync-freq=1 --file-test-mode=rndwr run

1	/tmp/sysbench --test=fileio --num-threads=1 --init-rng=on --max-time=60 --file-num=1 --file-total-size=8G --file-fsync-freq=1 --file-test-mode=rndwr run

This run gave 162 io/sec without snapshot and 32 io/sec with snapshot. The numbers are a bit better than with O_DIRECT but the gap is still astonishing.

The final run I did is emulating how Innodb would do buffer pool flushes – calling fsync every 100 writes rather than after each request:

/tmp/sysbench --test=fileio --num-threads=1 --init-rng=on --max-time=60 --file-num=1 --file-total-size=8G --max-requests=100000000 --file-fsync-freq=100 --file-test-mode=rndwr run

1	/tmp/sysbench --test=fileio --num-threads=1 --init-rng=on --max-time=60 --file-num=1 --file-total-size=8G --max-requests=100000000 --file-fsync-freq=100 --file-test-mode=rndwr run

This gets some 740 req/sec without snapshot and 240 req/sec with snapshot. In this case we get close to expected 3x difference.

The numbers are much higher in this case because even though we have one thread OS is able to submit multiple requests at the same time (and drives can execute them) – I expect if there would be BBU in this system we would see similar results for other runs.

So Creating LVM snapshot indeed could cause tremendous overhead – in the benchmarks I’ve done it is ranging from 3x to 6x. It is however worth to note it is worse case scenario – many workloads have writes going to the same locations over and over again (ie innodb circular logs) – in this case the overhead will be quickly reduce. Though still it takes some time and I would expect any system doing writes to experience the “performance shock” when LVM snapshot is created with greatly reduced capacity which when will improve as smaller number of pages actually need to be copied on writes.

Because of this behavior you may consider not starting backups instantly after LVM snapshot is creating but allowing it to settle a bit before further overhead with data copying is added.

The question I had is how could LVM backups work for so many users ? The reality is – for many applications write speed is not so critical so they can sustain this performance drop, in particular during some slow times (which often have 2x-3x lower performance requirements)

So we’ll do some research around LVM and I hope to do more benchmarks – for example I’m very curios how good is read speed from snapshot (in particular sequential file reads).

If you’ve done some LVM performance tests yourself or will be repeating mine (parameters posted) please let me know.

0 0 votes

Article Rating

55 Comments

Oldest

Newest Most Voted

Nils

17 years ago

I think you can use a separate device as COW device, I’m thinking of a ram disk or small SSD, this should lower the write load on the main disk array.

Author

Peter Zaitsev

17 years ago

Nils actually it was separate device in this case (though it was not SSD or memory). I would expect the overhead is still to be significant because to do the copy you need to read the old data so you’re looking for extra read on the same block device for each write. It is especially bad because writes can often be buffered and happen in the background while reads have to happen at once.

Lenz Grimmer

17 years ago

Thanks for the analyis, Peter! This seems to confirm the impression that I had from my experiences with mylvmbackup. Have you investigated if the performance drops even more, if multiple snapshots are active at the same time? I wonder why the Linux LVM snapshot implementation performs so badly in comparison to ZFS snapshots (which seem to add only very little overhead).

LenZ

Nils

17 years ago

LenzZ:
ZFS uses copy-on-write transactions for every write, so it can simply retain the old block for the snapshot instead of discarding it, so there is less additional overhead involved.

Baron Schwartz

17 years ago

Peter, I think we should NOT assume the performance shock goes away after the COW space isn’t getting more writes… We can run some more benchmarks to see.

1) Just make a snapshot on a volume that’s not getting any writes. Measure whether there is a read performance penalty now on either the original or the snaphsot volume. If there is, it would look like a bug to me.

2) Make a snapshot, write a bunch of stuff, then stop writing and repeat step 1). You would now expect a read penalty on the snapshot volume but not the original volume.

Author

Peter Zaitsev

17 years ago

Baron,

I did some benchmarks – see results for small file – as all file was copied in the snapshot performance regression reduced dramatically.

http://scale-out-blog.blogspot.com/

17 years ago

Hi Peter, thanks for this analysis. We are working on operating MySQL on Amazon and should be doing some performance analysis of Elastic Block Store (EBS), which–you guessed it–is enabled for snapshots. I’ll make a point of rerunning your sysbench cases. In this case our comparison would need to be EBS vs. local file systems, which so far as I know are not SAN-based and should provide a reasonable reference for comparison.

http://scale-out-blog.blogspot.com/

17 years ago

Oops, that last comment had my website instead of a name. Silly me. Robert

Author

Peter Zaitsev

17 years ago

Lenz, Nils

ZFS works differently. LVM as I understand does in place update storing the previous version in the snapshot space while ZFS generally writes data to the new locations. It is very interesting for me how (if somehow) ZFS is able to deal with fragmentation which happen if you do not update in place (whenever in snapshot mode or not)

With LVM – I’m really curious if it is possible to have it operate in more relaxed mode. Really – if we do not care about snapshot surviving crash (and we do not in case of backup) it is possible to make things much more optimal buffering writes to undo space and having large sequential writes even for random IO.

Also with rise of SSDs whole another thing happens – Random IO is not expensive any more and so writing to the new location on update would not slow down scans.

Actually I would really like to see SSDs providing hardware snapshot capabilities natively – they anyway keep the old data for a while and just could provide access to it and allow to pause purging.

Dieter_be

17 years ago

What do you mean with “bbu disabled” ? no write-back caching?

Author

Peter Zaitsev

17 years ago

Dieter,

Right. No WriteBack caching on this one. Though results are similar with cache.

Matthew Kent

17 years ago

Very interested in the read numbers as well. Please post 🙂

Perrin Harkins

17 years ago

Can you suggest an alternative for consistent backups that works for both InnoDB and MyISAM tables?

Coway

17 years ago

LVM is not bad as what you thought. So as RAID 5.

It is not a surprise to see the performance drop in Peter’s test. But you cannot judge something based on one aspect. This also applies in the testing. Test results vary a lot with the tools you choose. In a brief, the test results from ‘sysbench’ may be biased, or limited without comparing test results from other tools.

I used bonnie, and bonnie++, and iozone to test on our hardwares (IBM x3650, or Dell PE). The test results are consistent across tools used sometimes, or they are the opposite from different tools. So you need make a thorough comparison and choose a right plan for your environment.

The following is part of my test results:

To generate a 16G file using 8G RAM in Bonnie++

./bonnie++ -r 8000 -s 16000 -u root

Sequential Output Sequential Input Randoms Seeks ( /sec) time to finish
file size(MB) Per Char Block Rewrite Per Char Block
RAID 10, LVM 16000 83 173 120 101 1670 1629 9’42”
RAID 10 16000 88 178 167 97 2777 1630 9’31”
RAID 5 16000 88 232 124 99 330 1359 10’28”
RAID 5, LVM 16000 85 233 122 99 350 1359 10’35”

It shows LVM will add some overhead, but really can be neglected. Even RAID 5 has better performance compared to RAID 10 (see sequential output block)!!!

iozone gives even more statistics. It lists the numbers for Write,Read,Random Write,Random Read,Re-write,Re-read. Let’s just pick up Random Write and Random Read because of their importance:

RAID 10 with LVM vs. RAID 10 only:

Random Write:
For small files (<16k) creation, no much difference; For large files creation, LVM causes the speed drops from 1300000 to 900000.

Random Read:
For small files (<16k) creation, no much difference; For large files creation, LVM causes the speed drops from 3000000 to 2000000.

After a painful comparison and testing, I finally chose RAID 5 + LVM for MySQL server.
I chose RAID 5 instead of RAID 10 since I didn’t see much gain from RAID 10 in my situation and I want more disk space. LVM’s overhead doesn’t bother me too much and I use LVM snapshot to backup 200G databases.

In the production, RAID 5 + LVM (IBM x3650, database size 200G) gives me 7000 qps insert at peak, compared to 9000 qps insert from previous RAID 10 without LVM (Dell poweredge, database size 600G). We scaled out the database so the new database is 1/3 of the original. You can argue they are different hardware and in different database sizes and I am comparing apple to banana. Anyway, We are happy about the result from RAID 5 + LVM so far.

Author

Peter Zaitsev

17 years ago

Coway,

SysBench is specially implemented to emulate IO for MySQL/Innodb and so this is why I use it.

Also what did you compare here. Did you compare just raw partition to LVM ? In this case LVM really adds very little overhead or did you compare LVM to LVM when SNAPSHOT for target partition is enabled ?

Coway

17 years ago

Peter,
You are right. I just compared raw partition to LVM. I mainly use snapshot for MySQL backup and I do expect the backup time is as short as possible (say less than 1 hour), so the slowness when a snapshot for the target partition is enabled can be shortened. I will check the performance penalty during one snapshot is enabled.

Author

Peter Zaitsev

17 years ago

Coway,

The snapshot is the point. The backup time may be short but if your write IO capacity drops to 1/5th of normal during this time it just may be not enough to handle the load. It is good if you have night when system is rather idle but not all systems are like it.

John Laur

17 years ago

I love it when people say “LVM is fine because I ran Bonnie!” — LVM is a complete dog with snapshots. This is one of those things where it is absolutely immediately obvious to anyone who has ever tried to use it for this purpose. I am seriously surprised how many distros simply enable LVM as a matter of course these days and how many tools are built on top of this poor performer. In my own experience there are only two storage solutions (it’s not exactly proper to call LVM or any of these other “filesystems”) that hold up when using snapshots currently – ZFS and WAFL.

John Laur

17 years ago

I should add “that I have tested” — I think probably veritas and oracle have filesystems that do a good job also, but I haven’t looked. EMC’s snapshots are garbage too, though only give you about a 10-15% hit; so not nearly as bad as LVM

Kevin Burton

17 years ago

This is our experience with LVM….

It can be DOG slow when a snapshot is live. We’ve avoided this problem by first doing an InnoDB warmup when the DB starts up. When we need to clone a new slave reading the snapshot and COW operations can happen internally as the InnoDB buffer pool allows our DB to avoid having to do writes while catching up on replication.

Though, this would be one benefit of having the InnoDB buffer pool use the linux page cache. Transferring the snapshot from the LVM image would come from the cache without having to touch the disk.

… it probably goes without saying but the performance improvement you saw was because the COW was being completed and no more blocks were on the source disk.

Author

Peter Zaitsev

17 years ago

Perrin,

File system based stansot may have lower overhead. You can also check out R1Soft Backup. Some people have good success with it other reported performance and data restore problems though.

Baron Schwartz

17 years ago

So based on these benchmarks, Peter do you think it’s worth revisiting the “InnoDB I/O freeze” patch idea? What I think would be good is to freeze datafile I/O and let the log writes continue so InnoDB’s operation isn’t stopped. So we ought to see if the patch can be extended to freeze either/or type of writes.

Coway

17 years ago

I tested Z1soft software seriously. It was horrible on the backup speed. This happened one year ago. I am not sure if Z1soft has improved it or not. From my testing, the backup speed over network was merely 1 to 5 MB/s, even much slower than a simle mysqldump. Plus, the backup canno be completely restored due to some bugs. Z1soft admitted the bottleneck and problems. It is not a good for a large MySQL database. Haven’t tracked of Z1soft recent progress. I won’t choose it without seeing dramastical improvements.

Nils

17 years ago

As we are diverging away further from the LVM topic, how well does the official InnoDB backup tool perform? After all, it’s not that expensive.

Coway

17 years ago

InnoDB backup tool is cool on one side: real online backup and fast, as fast as mysqldump without any blocking. It is still insufficient in backing up mysql large databases with high volume of transactions(say insert qps ~ 4000 or higher). Innodb backup tool asks you to apply innodb log after backup before you restore data. A real example is: I backed up 100G or so data, the innodb db log file was about 20G, and the time for applying log file was 36 hours. The numbers from my memory may not be accurate, but the time used was true, too long to keep my patience.

Kevin Burton

17 years ago

Hey Baron…..

Regarding concurrent InnoDB freeze (which I guess is a good description)…

One would need large log files. It is also possible to start a freeze near the END of the log files which would mean that the writes might have to block until the transaction finished (unless one were smart enough to pay attention to the point where it started the backup).

Baron Schwartz

17 years ago

“A real example is: I backed up 100G or so data, the innodb db log file was about 20G, and the time for applying log file was 36 hours.”

I wonder if this suffers from the same redo/rollback performance hit due to the use of a linked list. If so, it may be fixable.

Author

Peter Zaitsev

17 years ago

Kevin – In general Innodb Freeze approach is limited to some applications only. You need large logs… and also the amount of modified data set is limited by the buffer pool.

I think what needs to happen is just a proper Open Source backup tool for Innodb. There are a lot of things to be improved in innodb hot backup to make it more efficient – I bet even more than with Innodb itself as the community eyes never looked at the sources.

Author

Peter Zaitsev

17 years ago

Coway – it is question of RAM you allow Innodb hot backup to use. There is also the “bug” in Innodb recovery which makes it very CPU inefficient. Also I think a lot can be optimized for applying very large log files… Innodb uses the same process as during recovery which is optimized for handling relatively small number of redo records.

Kevin Burton

17 years ago

Peter….. I realize that the InnoDB freeze approach is limited to the size of the logs.

I also realize that LVM is limited because you have to allocate space before hand to handle the COW.

So both have essentially the same limitations.

The problem with InnoDB logs being that using large log files is going to cause recovery to take FOREVER…

If this problem is fixed then it can open up the door to using the InnoDB freeze approach.

Coway

17 years ago

Peter,

I played with RAM setting when using innodb hotbackup. Using larger RAM made it slower. The default is the best on my test machine. I did log applying on a standalone box. Generating a huge log fils duing backup seems a defect.

Author

Peter Zaitsev

17 years ago

Kevin,

Innodb recovery will still depend on log file size. We can make it more optimal but it is not going to change the general approach. With LVM yes you need to allocate space.

Baron Schwartz

17 years ago

With LVM however, you take the write penalty hit only once per block; if a block is written multiple times it’s not re-copied to the undo space. That isn’t true of InnoDB logs.

Peter Vajgel

17 years ago

There are basically two snapshot technologies – copy-on-write (COW) and pointer-based block-map technologies of the log structured file systems (WAFL, ZFS). With log structured filesystems you get snapshots practically for free but there are other disadvantages like fragmentation. The traditional filesystems use COW technology on various levels – volume manager (LVM, VXVM) or in the filesystem (VxFS). If you use LVM snapshots the choice of the filesystem can be quite important. I run “randomio” benchmark on ext3 filesystem and compared it with xfs. The difference with LVM snapshot is big. “randomio” opens its target with O_DIRECT and does random io with multiple threads. Write percentage controls the ratio of reads/writes done by each thread. My setup was 6 SAS 10K drives with 512MB write-back cache in the controller in RAID1+0 configuration. First the raw numbers – 100 threads, 20% writes, 4K iosize. Iops reported are iops per second.

[[email protected] /home/pv/work/src/randomio-1.3]# ./randomio /dev/mapper/VolGroup20-mysqldata 100 0.2 0 4096 60
total | read: latency (ms) | write: latency (ms)
iops | iops min avg max sdev | iops min avg max sdev
——–+———————————–+———————————-
1793.9 | 1433.7 0.1 69.6 2853.9 82.4 | 360.2 0.1 0.1 4.2 0.1
1675.6 | 1340.0 0.1 74.6 1596.9 88.0 | 335.6 0.1 0.1 2.2 0.0

[[email protected] ~]# lvcreate –size 60g –snapshot –name snap /dev/VolGroup20/mysqldata
Logical volume “snap” created

1379.3 | 1104.6 0.1 68.7 1560.2 86.1 | 274.7 0.1 87.0 1555.4 178.1
1061.3 | 848.4 0.1 42.7 544.8 43.8 | 212.9 0.1 300.9 865.7 154.5
1082.1 | 863.4 0.1 43.2 572.5 45.5 | 218.7 0.1 286.7 950.5 145.4
1077.3 | 863.8 0.1 43.0 771.1 45.0 | 213.5 0.1 292.6 795.3 147.4

So that’s our base. Let’s try the same with xfs on a large file –

[[email protected] ~]# mount
/dev/mapper/VolGroup20-mysqldata on /data type xfs (rw,noatime,allocsize=1g)

[[email protected] ~]# ls -l /data
total 335544320
-rw-r–r– 1 root root 343597383680 Sep 9 16:05 foo

[[email protected] /home/pv/work/src/randomio-1.3]# xfs_bmap -l /data/foo
/data/foo:
0: [0..173950911]: 320..173951231 173950912 blocks
1: [173950912..341722991]: 183500864..351272943 167772080 blocks
2: [341722992..509495071]: 367263808..535035887 167772080 blocks
3: [509495072..671088639]: 550502464..712096031 161593568 blocks

[[email protected] /home/pv/work/src/randomio-1.3]# ./randomio /data/foo 100 0.2 0 4096 60
total | read: latency (ms) | write: latency (ms)
iops | iops min avg max sdev | iops min avg max sdev
——–+———————————–+———————————-
1887.2 | 1508.4 0.1 66.2 1694.0 77.1 | 378.8 0.1 0.1 2.9 0.1
1678.6 | 1341.6 0.1 74.4 1584.6 87.0 | 337.0 0.1 0.1 1.1 0.0
1670.9 | 1334.9 0.1 75.0 1567.0 88.0 | 336.0 0.1 0.1 1.0 0.0

[[email protected] ~]# lvcreate –size 60g –snapshot –name snap /dev/VolGroup20/mysqldata
Logical volume “snap” created

1592.2 | 1271.3 0.1 74.7 1401.9 89.3 | 320.8 0.1 14.7 1801.1 95.8
1073.0 | 859.6 0.1 43.5 682.1 44.5 | 213.4 0.1 294.2 796.2 150.9
1090.7 | 872.6 0.1 42.9 593.9 44.0 | 218.1 0.1 287.3 717.1 146.0

Good – xfs matches raw – with or without snapshot. Now let’s look at ext3 with the same size file –

[[email protected] /home/pv/work/src/randomio-1.3]# ./randomio /data/foo 100 0.2 0 4096 60
total | read: latency (ms) | write: latency (ms)
iops | iops min avg max sdev | iops min avg max sdev
——–+———————————–+———————————-
149.2 | 119.3 4.8 668.1 1042.5 54.1 | 30.0 5.2 660.9 1048.3 64.6
164.8 | 131.8 1.8 608.0 740.7 43.6 | 32.9 512.3 603.9 734.5 36.5
181.4 | 145.6 8.0 552.3 687.0 41.5 | 35.8 419.2 547.1 671.7 36.6
197.6 | 157.7 6.8 507.8 1118.2 67.9 | 39.9 0.1 498.9 1078.5 67.3
224.3 | 180.3 4.5 448.2 647.6 45.3 | 43.9 0.1 439.5 623.6 47.1
256.9 | 204.2 5.1 390.9 521.1 42.9 | 52.7 0.2 383.4 505.7 43.1
295.4 | 236.6 3.0 340.5 474.3 44.4 | 58.9 0.1 330.5 461.9 46.4
350.3 | 279.4 5.0 287.5 419.8 43.2 | 70.9 0.1 277.6 401.6 43.8
421.5 | 336.8 1.6 239.7 414.8 44.0 | 84.7 0.1 228.2 394.2 45.6
534.2 | 426.8 3.1 190.0 348.7 41.2 | 107.4 0.1 176.1 310.1 40.5
687.7 | 550.7 2.1 148.4 349.6 39.1 | 137.0 0.1 133.4 282.8 38.9
941.7 | 752.9 1.9 111.2 366.5 43.3 | 188.7 0.1 86.5 243.8 41.3
1258.9 | 1005.9 1.6 87.0 603.9 54.9 | 253.0 0.1 49.3 388.4 45.7
1498.6 | 1200.7 0.1 77.3 1056.2 72.1 | 297.9 0.1 24.1 559.9 46.9
1645.6 | 1313.2 0.1 73.6 1323.9 83.3 | 332.4 0.1 10.0 941.0 39.7
1692.2 | 1354.6 0.1 73.0 1436.4 86.1 | 337.6 0.1 3.2 640.0 26.9
1721.1 | 1377.5 0.1 72.3 1921.1 85.1 | 343.6 0.1 1.2 568.4 18.9
1721.9 | 1379.7 0.1 72.4 1562.9 83.7 | 342.2 0.1 0.2 69.7 2.4
1705.9 | 1364.8 0.1 73.3 1559.8 85.1 | 341.1 0.1 0.1 0.7 0.0
1693.7 | 1357.2 0.1 73.6 1480.5 87.1 | 336.5 0.1 0.1 2.1 0.0
1696.2 | 1355.7 0.1 73.8 1520.0 86.4 | 340.4 0.1 0.1 1.5 0.0

[[email protected] ~]# lvcreate –size 60g –snapshot –name snap /dev/VolGroup20/mysqldata
Logical volume “snap” created

626.7 | 502.0 0.1 168.7 2581.4 192.8 | 124.7 0.1 121.5 2471.1 203.7
310.3 | 248.7 6.2 321.2 529.4 61.9 | 61.6 7.0 326.0 518.9 62.9
305.5 | 244.0 2.1 326.1 595.8 61.2 | 61.5 10.0 331.3 579.2 61.1
313.3 | 251.2 4.9 319.0 503.5 60.2 | 62.1 12.8 322.3 496.6 59.9
300.5 | 239.7 4.1 331.4 1207.4 81.3 | 60.7 10.0 335.7 1201.1 83.5
312.7 | 250.8 4.1 318.6 587.2 65.7 | 61.8 10.0 324.8 574.1 67.0
309.3 | 247.5 3.3 322.5 540.3 58.9 | 61.8 9.0 326.3 517.9 57.3

What’s going on? I’ve glanced over the code real quick and this is my speculation. There are 2 issues with ext3. The first one is that it is not an extent based filesystem. It is block based. What that means is that there are lots of indirect address blocks addressing your data blocks and they don’t have to be all cached if they are not accessed frequently enough. So a read (or a write) might require a disk read in order to map your logical offset to the actual physical offset in the file. That means extra io’s before you even get to read/write your actual data. xfs (and vxfs and zfs) are extent based filesystems and the block maps of large files can fit into an inode itself if the filesystem is not severely fragmented so mapping your offset is very fast (you can preallocate space in vxfs and in xfs – I use a mount option allocsize=1g).

The second problem is the level of parallelism a filesystem allows when it comes to competing reads/writes. I believe that ext3 is taking an exclusive rwlock on each write and doesn’t release it through the whole io. On the other hand xfs (unless you are growing the file) takes and holds the lock in a shared mode if the file is opened in O_DIRECT mode. This can have big consequences.

Look at the slow start of ext3 – it took nearly 15 minutes to come to the raw speeds. Why is that? Look at the write latencies – at the beginning they are big – which we would not expect with write-back cache. I believe that they are so big because the indirect address blocks were not in the cache. So what should be a very fast write under normal circumstances (write-back cache) turns into a read and a write while holding the exclusive rwlock – blocking all the readers at the same time. Eventually (in 15 mins or so) all of the indirect address blocks are cached and ext3 resumes the “raw” performance.

But once the snapshot is created it plummets again. It’s the same reason – a write turned into a COW read and a write will hold the exclusive rwlock for much longer than if it didn’t have to do a COW operation and so it chokes readers once again. Bigger write ratio has even a more drastic effect on ext3 + snapshot performance.

“randomio” might not be a good representation of what MySQL is doing so we are currently testing xfs with MySQL. But if my speculation are valid then I would expect a smaller degradation with xfs.

Author

Peter Zaitsev

17 years ago

Peter,

Thanks for posting results.

Kostas Georgiou

17 years ago

What was the chunksize used for the snapshot? A small one (4k?) probably makes more sense for a random IO pattern.

Brian Sneddon

17 years ago

If you bypass lvcreate and just configure the snapshot in the device-mapper manually you can designate the device as persistent or not. If it’s not persistent then it won’t write out metadata to the disk… if that’s where the bottleneck is in the code then that may reveal it. It may be possible to do it through lvcreate, but I’ve just never tried it myself.

Author

Peter Zaitsev

17 years ago

Kostas,

In this case it was default. If you suggest some parameters which should help performance let me know we should try it.

Author

Peter Zaitsev

17 years ago

Brian,

Yeah I will try it again when I have the chance…. also will see with XFS.

The problem is based on discussion we’re still looking at 3X of IOs (one read and 2 writes) which is already too large.

Rilson Nascimento

17 years ago

Hi Peter,

Have you tried Zumastor snapshot and replication tools?

Zumastor is an enterprise storage server for Linux. It keeps all snapshots for a particular volume in a common snapshot store, and shares blocks the way one would expect. Thus making a change to one block in a file in the original volume only uses one block in the snapshot store no matter how many snapshots you have (verbatim from Zumastor how-to).

LVM snapshots design has the surprising property that every block you change on the original volume consumes one block for each snapshot. The resulting speed and space penalty makes the use of more than 1 or 2 snapshots at a time impractical (verbatim from Zumastor how-to page as well).

Author

Peter Zaitsev

17 years ago

Rilson,

In this case we’re just using one snapshot (which is enough for database backup) and it does not work.

I have not tried Zumastor – if you can rerun my benchmarks and post results that would be quite interesting.

17 years ago

‘@Peter Vajgel:
I am wondering how Ext4 would compete with XFS: Extents and multiblock + delayed allocation.
http://kernelnewbies.org/Ext4

Timothy Denike

16 years ago

I found similar benchmark results (6:1) under a single-threaded test, but the random write results approached 2:1 as I increased thread concurrency to 10. (Presumably because the read on the COW scales with concurrency, while writes do not.) Seems to me the penalty could be the single-threaded read. Is this a valid test to profile InnoDB load, or are all InnoDB writes going to be serialized through a single thread?

Anon Coward

16 years ago

Wow, thanks for the thread folks, I’ve found it educational.

As a Solaris guy running Veritas vxfs for the past decade and now ZFS, I have recently begun to find customers running mysql on linux, with and without LVM. It looks like LVM has a bit of catching up to do!

Andreas

15 years ago

Hi all,
just stumbled over this article searching for a solution for a very very bad impact on innodb performance after doing a snapshot.
Interestingly we get a acceptable (expected) penalty as soon as the snapshot is created. BUT: As soon as we start to read the innodb tablespace file from the snapshot performance decreases dramatically.

Does anybody has a hint how to debug such a situation? I don’t have an explanation for such a bad impact.

Best regards
Andreas

Author

Peter Zaitsev

15 years ago

Andreas,

Yes… The performance when you copy from snapshot often just drops to the floor. You can check with different IO scheduler but in general I do not know perfect solution.
Consider Xtrabackup – we use it in many cases to avoid having penalty of LVM snapshot

Mirrors

15 years ago

Peter,
I tested mysql-bench suite,including test-create,test-insert,test-select,but only test-create suffered from LVM snapshot,and about 10 times slower.But it’s seems that select/insert operations are not affected seriously.Is that right?

Author

Peter Zaitsev

15 years ago

Mirrors,

mysql-bench is very poor tool for performance estimation for majority of workloads. It is using very small data set and in this case most of the tests will not be IO bound. test-create calls fsync() for every table created which is likely why it is affected. Test with your real workload or similar on data set of your scale to get meaningful results.

Brett

15 years ago

I know many of you here are talking about Databases setups etc, but on more generally loaded systems, webservers, email, etc would breaking down the size of Logical volumes relieve some of the problem? If you took a snapshot of an extremely large and active Logical Volume then the amount of IO changes would obviously be large and hence overall performance would drop during snapshots. But if you broke down your drives into many smaller Logical Volumes then the performance hit per snapshot (per logical volume) would be less overall. I am talking about more general purpose setups, not pure database systems. This is what I am currently looking at now as a possible way to relieve the snapshot “blues” for our servers. Rather than having one large logical volume I am considering breaking them down into many smaller logical volumes. Do people think this could be effective in reducing the performance hit of snapshots?

LenZ

15 years ago

Brett: Sure, if you move the hot spots of activity to separate volumes, there will be less amount of data that has to be copied around while the snapshot exists. It also helps to remove the size required to keep a snapshot active. But it also means that you won’t be able to create a consistent/atomic snapshot across all your data. If that’s not a concern, it sounds like an option to look into.