Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

ext4 vs xfs on SSD

March 15, 2012

Author

Vadim Tkachenko

Benchmarks

MySQL

Share this Post:

As ext4 is a standard de facto filesystem for many modern Linux system, I am getting a lot of question if this is good for SSD, or something else (i.e. xfs) should be used.
Traditionally our recommendation is xfs, and it comes to known problem in ext3, where IO gets serialized per i_node in O_DIRECT mode (check for example Domas’s post)

However from the results of my recent benchmarks I felt that this should be revisited.
While I am still running experiments, I would like to share earlier results what I have.

I use STEC SSD drive 200GB SLC SATA (my thanks to STEC for providing drives).

What I see, that ext4 still has problem with O_DIRECT. There are results for “single file” with O_DIRECT case (sysbench fileio 16 KiB blocksize random write workload):

- ext4 1 thread: 87 MiB/sec

- ext4 4 threads: 74 MiB/sec

- xfs 4 threads: 97 MiB/sec

Dropping performance in case with 4 threads for ext4 is a signal that there still are contention issues.

I was pointed that ext4 has an option dioread_nolock, which supposedly fixes that, but that option is not available on my CentOS 6.2, so I could not test it.

At this point we may decide that xfs is still preferable, but there is one more point to consider.

Starting the MySQL 5.1 + InnoDB-plugin and later MySQL 5.5 (or equally Percona Server 5.1 and 5.5), InnoDB uses “asynchronous” IO in Linux.

Let’s test “async” mode in sysbench, and now we can get:

- ext4 4 threads: 120 MiB/sec

- xfs 4 threads: 97 MiB/sec

It corresponds to results I see running MySQL benchmarks (to be published later) on ext4 vs xfs.

Actually amount of threads does not affect the result significantly. This is to another question I was asked, namely: “If MySQL 5.5 uses async IO, is innodb_write_io_threads still important?”, and it seems it is not. In my tests it does not affect the final result. I would still use value 2 or 4, to avoid scheduling overhead from single thread, but it does not seem critical.

In conclusion ext4 looks like an good option, providing 20% better throughput. I am still going to run more benchmark to get better picture.

The script for tests:

for size in 100
do

cd /mnt/stec
sysbench --test=fileio --file-num=1 --file-total-size=${size}G prepare
sync
echo 3 > /proc/sys/vm/drop_caches

for numthreads in 4
do
sysbench --test=fileio --file-total-size=${size}G --file-test-mode=rndwr --max-time=3600 --max-requests=0 --num-threads=$numthreads --rand-init=on --file-num=1 --file-extra-flags=direct --file-fsync-freq=0 --file-io-mode=sync --file-block-size=16384 --report-interval=10 run | tee -a run$size.thr$numthreads.txt
done
done

for size in 100

cd /mnt/stec

sysbench --test=fileio --file-num=1 --file-total-size=${size}G prepare

sync

echo 3 > /proc/sys/vm/drop_caches

for numthreads in 4

sysbench --test=fileio --file-total-size=${size}G --file-test-mode=rndwr --max-time=3600 --max-requests=0 --num-threads=$numthreads --rand-init=on --file-num=1 --file-extra-flags=direct --file-fsync-freq=0 --file-io-mode=sync --file-block-size=16384 --report-interval=10 run | tee -a run$size.thr$numthreads.txt

done

Follow @VadimTk

0 0 votes

Article Rating

28 Comments

Oldest

Newest Most Voted

Rolf

14 years ago

Could you provide a comparison this to rawfs? Also I was under the impression that journaled FS on SSD was a bad idea for SSD.

Aurimas Mikalauskas

14 years ago

Vadim, was that done with write-back cache enabled ?

rj03hou

14 years ago

If we do FileIO test for mysql, do you think we need to test many files?
such as –file-num=32

Author

Vadim

14 years ago

Aurimas,

Yes, the cards have write cache enabled.

Author

Vadim

14 years ago

rj03hou,

You can use 32 files, it does not show big difference in “async” io.

My point using single file in this benchmark was to find out if serialization is still a problem in ext4

joe

14 years ago

What would be interesting is to see your data beyond 4 threads. The presentation http://www.youtube.com/watch?v=FegjLbCnoBw shows EXT4 did really well until scaling beyond 4 threads.

Author

Vadim

14 years ago

joe,

in “async” mode increasing beyond 4 threads does not really affect the final result.

Mike

14 years ago

Are you using innodb file-per-table? ibdata getting huge would drive me nuts

14 years ago

Few days ago I have run very similar test: sysbench OLTP RW benchmark on spinning (non-SS) disks. The file systems on this “inherited” machine default to ext4, but I had Domas’ claims in my mind that xfs is superior. So I recreated the benchmark fs as xfs and repeated the sysbench run.

Up to 8 threads xfs was few percent faster (~10% on average).
At 16 threads it was a draw (2036 tps vs. 2070 tps).
At 32 threads ext4 was 28% faster (2345 tps vs. 1829 tps).
At 64 threads ext4 was even 47% faster (2362 tps vs. 1601 tps).
At higher concurrency ext4 lost it’s bite, but was still constantly better than xfs.

I did not look deeper into this, but used the fs modules as they come with openSuSE and with default mount options.

Mike Schueler

14 years ago

Useful results for when we move to SSDs..

Were sync_binlog=1 and innodb_flush_at_trx_commit=1 ?

Author

Vadim Tkachenko

14 years ago

Mike Schueler,

For these results I did not use MySQL, it was just sysbench fileio.

Mike Schueler

14 years ago

Oh, of course, hah. Well, I’m eagerly awaiting your MySQL results.

Dave Chinner

14 years ago

Hi Vadim,

YOu’ve just found a known problem that we are working to fix. This is regressions as a result of cleaning up the IO path in the XFS code – it’s put a lot more pressure on an exclusive lock by removing other bottlenecks. Hence when we hit contention on it, it degrades more quickly than it used to. This was brought to my attention recently

http://oss.sgi.com/archives/xfs/2012-01/msg00325.html

and this was the prototype patch I wrote to fix the overwrite DIO performance problem:

http://oss.sgi.com/archives/xfs/2012-02/msg00219.html

Now that the prerequisite fixes have been merged into 3.4, I can move forward with this fix.

I ran the sysbench tests to check this was the problem – I don’t have a SSD available right now, so I ran the tests on a 2GB ramdisk with a 1.8Gb file. Unpatched results – the numbers are throughput /IOPS, with throughput being in GB/s

sync async
threads throughput throughput
XFS ext4 XFS ext4
1 1.90/124k 1.41/92k 1.72/112k 1.41/92k
2 1.01/64k 1.65/108k 0.97/62k 1.65/108k
4 0.27/17k 1.55/102k 0.21/13k 1.55/102k
8 0.13/8k 1.45/95k 0.15/9k 1.45/95k
16 0.12/7k 1.45/95k 0.12/7k 1.45/95k

It’s pretty clear from these results that lock contention is killing XFS as the thread count grows. ext4 performance shows that it uses exclusive locking as well, but it is not degrading like XFS is due to different lock types being used. With the above patch forward ported to 3.4-pre-RC1, the XFS results are:

sync async
threads throughput throughput
vanilla patched vanilla patched
1 1.90/124k 1.83/120k 1.72/112k 1.69/111k
2 1.01/64k 2.85/185k 0.97/62k 2.57/168k
4 0.27/17k 3.68/241k 0.21/13k 3.41/223k
8 0.13/8k 4.42/290k 0.15/9k 4.16/273k
16 0.12/7k 4.95/325k 0.12/7k 4.86/319k

Throughput scales with thread count – each thread runs at 100% CPU utilsation, and XFS gets up to 3x as much throughput as ext4 does. Other testing I’ve done on this machine with this patch has given close to a million 4k overwrite IOPS to a single file when completely CPU bound…

So, basically, XFS is still the filesystem you want for direct IO – but like any filesystem bugs do creep in as we improve stuff. In future, you might want to report such problems to the XFS list, rather than just blogging about it and assuming that it’s expected behaviour. Someone who thought this was a suspect result pointed me at your blog, I would have never found it otherwise.

Cheers,

Dave.

Author

Vadim Tkachenko

14 years ago

Dave,

Thank you for explaining this, good to see you have a fix for it.
The main question for me is, will the fix be available in RedHat 5 or 6 distributions.

This is the main platform the users are running MySQL database.

In given condition, my point of this post was to provide information what filesystem provides better throughput.

I will be waiting for fix in RedHat, until that I will consider ext4 as good alternative.

Dave Chinner

14 years ago

> I will be waiting for fix in RedHat, until that I will consider ext4 as good alternative.

That’s a very short-sighted response. If you report the bug to RH, then a fix should become available at some point in the not-to-distant future, too (and perhaps you should pay for RHEL so you can ask for bug fixes as a proactive service to your customers). At that point, the normal rule of thumb (use XFS) will prevail, and all that will have resulted is that you have a bunch of unhappy people left using ext4 because they’ve already deployed it to production on the back of your recommendation…..

Recommend ext4 for the right reasons – XFS having a performance regression isn’t one of them, because the moment XFS is fixed (and it will be) your recommendation is invalid. Your blog post will continue to be found by google for years to come, so it’s not just your current readers that you are providing bad advice to….

Dave.

Jerry Westerby

14 years ago

Vadim,

Selecting something as important as an Operating System or major component like a File System on the basis of ONE fairly small issue is short-sighted in the extreme. Take a look at the bug list Red Hat at ant one point — the list is quite long. For that matter look at the bug list or patch list for MySQL/InnoDB at any one point in time. The list will be long, and to the uninitiated it will look frightening.

The same is true for commercial software, if the list is known. With commercial software and fat, annual license agreements, customers can try to pressure companies to move one patch ahead of another. I’ve worked in that environment, and can tell you that I’ve almost never seen a patch moved up that way unless something is completely broken.

There used to be, and in some places still is a developer culture of common decency. That’s where things like the support of ZFS outside of Solaris comes from… people do technical work of the pleasure of it, not for a pay check.

So in the first count, you are reacting to a patch issue in a way that is “fussy” in the extreme. The systems I put on-line have ZFS and we rely on it. I can just imagine some manager with little technical skill throwing your post over the wall at me. Worse, someone electing to use ext3/4 because of your post.

In the second event I find your comments rather disingenuous. Do you pay license fees for your operating system?

Author

Vadim Tkachenko

14 years ago

Jerry,

First, thank you for a disagreement, I do not often see discussion in comments, but this is what makes it interesting.

I am in open source business myself, so I understand your sentiments, however I can’t agree with your logic.
For me it sounds like: “This is a free food, eat what you are given and do not complain”.

Now about recommendations. So what kind of criteria do you think I should use to recommend that or another product/filesystem?
For me – amount of bugs (how stable it is) is one of most important factors.

We are in consulting business also, and customers are most interested in recommendation that they can use today, now in production deployment. How stable will be product is 2 years is rather theoretical question for personal curiosity.

There is big difference when problem is considered solved for developers and for ops.
For developer – push in source code tree – problem solved.
For ops – it should be available on official distribution, in repository, and often it has to pass QA cycle, which sometime can be as long as 6 months.

Following your logic, it seems I just should go ahead and recommend to use btrfs to everyone, because it has so great concept. Who counts that it shows 5x worse performance than ext4/xfs and crashes every so often. It still free product and all problems will be fixed in 3-5 years, so everybody will be prepared by that time.

Benoit Sigoure

14 years ago

I’ve also had to benchmark ext4 vs XFS, on a RAID10 of spinning drives. Many DBAs like to assert that XFS is the way to go for MySQL, but I’m not sure how frequently they benchmark XFS vs ext4, and how much of their recommendation comes from the days of ext2/ext3.

So I decided to use sysbench and plot some graphs to compare the O_DIRECT performance of ext4 and XFS on the hardware we bought. I found that ext4 worked really well out of the box, while XFS required poorly documented knobs to be turned and still couldn’t beat ext4:

http://blog.tsunanet.net/2011/09/ext4-2x-faster-than-xfs.html

So we built all our MySQL DBs with ext4, and it’s been working great for the past 7 months. I recently had an interesting conversation with someone building a large Ceph cluster on top of XFS instead of btrfs, and his feedback was that some recent developments in the XFS world have greatly enhanced the metadata performance of XFS (especially with regards to metadata fragmentation), so maybe it’s time to do another benchmark.

What I found with XFS is that, to my great surprise, changing the number of allocation groups, setting the correct sunit/swidth for the RAID array, or using nobarrier, all have no statistically significant impact on performance, which seems to indicate that XFS had some bottleneck internally, maybe the lock contention issue that Dave was referring to above.

Disclaimer: I’m just someone who runs their own benchmarks, and am not religious about filesystems.

Author

Vadim Tkachenko

14 years ago

Benoit,

Thank you for sharing.
What software do you use to show benchmark results?

XFS may have fixed bugs and have performance improvements, but again, I need to have it my kernel to able to test it.

Benoit Sigoure

14 years ago

I wrote a couple scripts to be able to visualize sysbench results, they’re available here: https://github.com/tsuna/sysbench-tools

I understand you need the fixes in your kernel, and many of Percona’s customers are in the same situation. That’s why we don’t use RHEL/CentOS, because they come with generally outdated software and it’s hard to use newer versions or upgrade.

At the time I ran the benchmark above, we were using Linux 2.6.32, as my blog says. We’re now on 2.6.38, and whenever is the next time our servers reboot, they’ll be on kernel 3.x (depending what ‘x’ we use then, we’re currently on 3.0).

But I agree with you that there’s a difference between when the developers consider the issue fixed, and when end users can actually get the fix, regardless of whether or not you’re on a slow track with RHEL or on a faster track with a distro that’s closer to upstream.

sandeep

14 years ago

quick question – the xfs_freeze feature is really killer when one wants to do consistent snapshots/backup of database directories, etc. Is there anything similar in ext4 or is the benchmark of ext4 +lvm (which I know allows this) on par with xfs_freeze (without lvm) ?

Author

Vadim Tkachenko

14 years ago

sandeep,

I am not aware about alternative in ext4.

There is a patch for MySQL which can do somewhat similar on any filesystem (this comes with Percona XtraDB Cluster)

Justin Rovang

14 years ago

Benoit, nice little script you’ve got there; using it!

prl77

13 years ago

Hey guys, great discussion, thank you for your contributions.
I’m currently building a new database server and am to the point of choosing a filesystem. I was going to go with XFS, but the above findings are concerning. I’m glad I came across this discussion.
Question for Dave Chinner or anyone else that might know – is there a released kernel that has this XFS fix included?

Author

VadimTk

13 years ago

prl77

From
https://bugzilla.redhat.com/show_bug.cgi?id=807503

I read that the fix will be in RedHat 6.4

Joe

13 years ago

I was just curious if you will test this again now that the xfs bug looks to be fixed in rhel 6.4. I am assuming centos 6.4 will be release pretty soon.

Elias

11 years ago

How would one know if you are running a patched kernel or not?
I’m using Ubuntu Server 12.04 with their standard kernel 3.2.0.

danblack

11 years ago

From Dave’s patch I found it merged in 3.5+ kernels https://github.com/torvalds/linux/commit/507630b29f13a3d8689895618b12015308402e22

If you see the same sort of code in a kernel sources (apt-get source linux-image-3.2.0-4-amd64) I guess its fixed. linux_3.2.60-1+deb7u3 didn’t include it FWIW.