MySQL/ZFS Performance UpdateAs some of you likely know, I have a favorable view of ZFS and especially of MySQL on ZFS. As I published a few years ago, the argument for ZFS was less about performance than its useful features like data compression and snapshots. At the time, ZFS was significantly slower than xfs and ext4 except when the L2ARC was used.

Since then, however, ZFS on Linux has progressed a lot and I also learned how to better tune it. Also, I found out the sysbench benchmark I used at the time was not a fair choice since the dataset it generates compresses much less than a realistic one. For all these reasons, I believe that it is time to revisit the performance aspect of MySQL on ZFS.

ZFS Evolution

In 2018, I reported ZFS performance results based on version 0.6.5.6, the default version available in Ubuntu Xenial. The present post is using version 0.8.6-1 of ZFS, the default one available on Debian Buster. Between the two versions, there are in excess of 3600 commits adding a number of new features like support for trim operations and the addition of the efficient zstd compression algorithm.

ZFS 0.8.6-1 is not bleeding edge, there have been more than 1700 commits since and after 0.8.6, the ZFS release number jumped to 2.0. The big addition included in the 2.0 release is native encryption.

Benchmark Tools

The classic sysbench MySQL database benchmarks have a dataset containing mostly random data. Such datasets don’t compress much, less than most real-world datasets I worked with. The compressibility of the dataset is important since ZFS caches, the ARC and L2ARC, store compressed data. A better compression ratio essentially means more data is cached and fewer IO operations will be needed.

A well-known tool to benchmark a transactional workload is TPCC. Furthermore, the dataset created by TPCC compresses rather well making it more realistic in the context of this post. The sysbench TPCC implementation was used.

Test Environment

Since I am already familiar with AWS and Google cloud, I decided to try Azure for this project. I launched these two virtual machines:

tpcc:

  • benchmark host
  • Standard D2ds_v4 instance
  • 2 vCpu, 8GB of Ram and 75 GB of temporary storage
  • Debian Buster

db:

  • Database host
  • Standard E4-2ds-v4 instance
  • 2 vCpu, 32GB of Ram and 150GB of temporary storage
  • 256GB SSD Premium (SSD Premium LRS P15 – 1100 IOPS (3500 burst), 125 MB/s)
  • Debian Buster
  • Percona server 8.0.22-13

Configuration

By default and unless specified, the ZFS filesystems are created with:

There are two ZFS filesystems. bench/data is optimized for the InnoDB dataset while bench/log is tuned for the InnoDB log files. Both are compressed using lz4 and the logbias parameter is set to throughput which changes the way the ZIL is used. With ext4, the noatime option is used.

ZFS has also a number of kernel parameters, the ones set to non-default values are:

Essentially, the above settings limit the ARC size to 2GB and they throttle down the aggressiveness of ZFS for deletes. Finally, the database configuration is slightly different between ZFS and ext4. There is a common section:

and when ext4 is used:

and when ZFS is used:

ZFS doesn’t support O_DIRECT but it is ignored with a message in the error log. I chose to explicitly set the flush method to fsync. The doublewrite buffer is not needed with ZFS and I was under the impression that the Linux native asynchronous IO implementation was not well supported by ZFS so I disabled it and increased the number of IO threads. We’ll revisit the asynchronous IO question in a future post.

Dataset

I use the following command to create the dataset:

The resulting dataset has a size of approximately 200GB. The dataset is much larger than the buffer pool so the database performance is essentially IO-bound.

Test Procedure

The execution of every benchmark was scripted and followed these steps:

  1. Stop MySQL
  2. Remove all datafiles
  3. Adjust the filesystem
  4. Copy the dataset
  5. Adjust the MySQL configuration
  6. Start MySQL
  7. Record the configuration
  8. Run the benchmark

Results

For the benchmark, I used the following invocation:

The TPCC benchmark uses 16 threads for a duration of 2 hours. The duration is sufficiently long to allow for a steady state and to exhaust the storage burst capacity. Sysbench returns the total number of TPCC transactions per second every 10s. This number includes not only the New Order transactions but also the other transaction types like payment, order status, etc. Be aware of that if you want to compare these results with other TPCC benchmarks.

In those conditions, the figure below presents the rates of TPCC transactions over time for ext4 and ZFS.

TPCC transactions ZFS

MySQL TPCC results for ext4 and ZFS

During the initial 15 minutes, the buffer pool warms up but at some point, the workload shifts between an IO read bound to an IO write and CPU bound. Then, at around 3000s the SSD Premium burst capacity is exhausted and the workload is only IO-bound. I have been a bit surprised by the results, enough to rerun the benchmarks to make sure. The results for both ext4 and ZFS are qualitatively similar. Any difference is within the margin of error. That essentially means if you configure ZFS properly, it can be as IO efficient as ext4.

What is interesting is the amount of storage used. While the dataset on ext4 consumed 191GB, the lz4 compression of ZFS yielded a dataset of only 69GB. That’s a huge difference, a factor of 2.8, which could save a decent amount of money over time for large datasets.

Conclusion

It appears that it was indeed a good time to revisit the performance of MySQL with ZFS. In a fairly realistic use case, ZFS is on par with ext4 regarding performance while still providing the extra benefits of data compression, snapshots, etc. In the next post, I examine the use of cloud ephemeral storage with ZFS and see how this can further improve performance.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure, open-source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

Download Percona Distribution for MySQL Today

Subscribe
Notify of
guest

28 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Praji Sankaran Venu

Would you comment on using MySQL compression on ext4 vs uncompressed tables in ZFS. I was compressed tables on ext4 with LVM of four , 1 TB hard drives in EC2.

ghislain

hey what about 5.11 btrfs ?

Dan

btrfs improved a lot and usually only most advance features like raid5/6 are still broken and others need more fine-tune, but unless you need some advance features, usage is stable and with good performance (usually a little slower, some times lot slower, other even faster than ext4)
check https://btrfs.wiki.kernel.org/index.php/Status#Overview
notice that nodatacow may be required to avoid over fragmentation of the DB.
i remember that some sqlite optimizations where made, dunno if also apply to mysql/mariadb

Dan

Here is a recent benchark with bttrfs, xfs, ext4 and f2fs. It also includes some basic mariadb benchmarks, but notice that this site is known to use default settings, they don’t do any finetuning, so probably btrfs have CoW enable for the mariadb tests.
https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.14-File-Systems

Bruno Souza Cabral

Thanks for this comparison Yves. I have a very large database with very compressible text data. I had a great success with MyRocks, but for some internal tech debt I had to migrate this database to Group Replication. The InnoDB page compression + Percona compressible columns are 3x larger. I´m thinking about dropping Page compression and testing ZFS with compression. Page compression feels like a inefficient hack with holes in my disk. Do you think that this would work better (better compression and/or better performance) ?

Hi Yves
Very interesting blog post thank you
It would have been interesting to see the results with local nvme disks where the io perfs are much higher to see if zfs still compare to ext4
https://docs.microsoft.com/fr-fr/azure/virtual-machines/lsv2-series
What do you think?
Best regards
Camille

Ivan Baldo

Thanks for sharing!
It would be interesting to compare with ext4 and InnoDB compression.
Also, I don’t see benchmarks with the lazytime option, is it negligible?

Ivan Baldo

But what happens with files modification times?
Isn’t lazytime useful for that?
Thanks!

Domas Mituzas

People are using ext4 for databases? weird!

Alex Samorukov

Thank you, great article. Just curious, did you ever tried to repeat same test on FreeBSD? They now sharing OpenZFS code, so kernel will be only difference. I can try to reproduce that…

Eric Robinson

Hi Yves, I repeated your test on local hardware and compared the results to Ext4 with InnoDB table compression. See the following graph.

https://www.dropbox.com/s/e6si0fm9i5a9d5o/zfs-vs-ext4.png?dl=0

I believe I followed your steps exactly. The test system is a Dell PowerEdge R640 with 48 cores, 1 TB RAM, and 6 x Dell 3.2 TB Enterprise NVME drives. On the ZFS test, all 6 drives are in one zpool but I created separate zfs filesystems for data and logs, per your instructions. For the Ext4 test, I created a single LVM RAID array with 6 stripes (4K block size), but the the data and logs are on the same LV. Even so, Ext4 dramatically outperformed ZFS.

The compression ratio on ZFS was much better. The test database, which is 189 GB uncompressed, ended up being 92 GB with InnoDB table compression, but only 59 GB on ZFS.

Something to note is that IOWAIT averaged about 2% during the Ext4 test, but was around 0-0.5% on the ZFS test.
Another interesting finding is that performance was only slightly impacted when I used a much smaller buffer pool. With an 8 GB pool (versus 26 GB), average performance on ZFS only dropped by about 100 TPS. I have not tried that with Ext4 yet.

These results are disappointing because I really want to use ZFS, but now I am not sure I can justify it. It is possible that I missed something. You’re welcome to examine the test system if you are interested.

Eric Robinson

If the link does not work when clicking it, try pasting it into a browser manually. That worked for me.

Eric Robinson

I performed a default install of percona-server-server.x86_64 8.0.26-16.1.el8 and made no changes beyond the ones in your procedure. Regarding the ARC stats, I see..

c 4 125549541504
cached_only_in_progress 4 0
c_max 4 540636465152
c_min 4 33789779072
size 124531475032

> ZFS would start to show its value with a dataset size greater than the memory size.

Ultimately, that describes the situation I will have. The server will be home to 100+ separate instances of MySQL with customized configurations, none of which will enjoy the benefit of the server’s full resources. The databases will range in size from 10 GB to 400 GB, most around 20-25. GB. All instances will run with buffer pools that are as small as is feasible.

Eric Robinson

You mean the sysbench –scale parameter?

Eric Robinson

Clearly the glitch was was with me, not the forum. 🙂

Eric Robinson

Hi Yves, I ran the sysbench test on ZFS at 2500 scale. Resulting graph is here:

https://www.dropbox.com/s/2mblcsk0hg6f0bb/zfs-at-scale.png?dl=0

The numbers are not terribly different from the test at 200 scale, about 70 TPS less on average, but I’m not sure what that means. By the way, did you catch my original statement that there will be 100+ separate instances of MySQL on the server, not just 100+ databases? Only a portion of the server’s total resources will be allocated to each instance. Obviously the conventional wisdom of 70% server RAM for the buffer pool is off the table.

Eric Robinson

Yves, here’s a question for you. We’re using ZFS zvols as the backing devices for DRBD volumes, The filesystems we create on top of DRBD could be either ZFS, EXT4, or XFS. I’m thinking ZFS on top of DRBD is not the best idea because ZFS normally wants to talk directly to the drives, so I guess DRBD would defeat that. That leaves EXT4 or XFS.If we go with one of those, which of your sample my.cnf configs would we use for testing?

Kamil

Thank you for publishing your results.

I think that to be more fair you should reduce innodb buffer by 8GB in case of ZFS (to use same amount of RAM).

I have performed similar tests with 36GB database:

1GB innodb buffer + 4GB ARC on ZFS (about 5GB RAM total)5GB innodb buffer for ext4 (about 5GB RAM total)I removed skip-log-bin in my configurations, because I need it, and my server has 8k recordsize set in ZFS on whole ZFS pool (I don’t remember why).

Resuts for ZFS are still much better than EXT4 (zfs is faster by 50%) and I have no issues with latency on ZFS at all, actually it has lower latency than setup with ext4.

And the compression on ZFS is outstanding. My database uses only ~50% of space on ZFS.

Last edited 1 month ago by Kamil