ZFS on Linux and MySQL

Data centerI am currently working with a large customer and I am involved with servers located in two data centers, one with Solaris servers and the other one with Linux servers. The Solaris side is cleverly setup using zones and ZFS and this provides a very low virtualization overhead. I learned quite a lot about these technologies while looking at this, thanks to Corey Mosher.

On the Linux side, we recently deployed a pair on servers for backup purpose, boxes with 64 300GB SAS drives, 3 raid controllers and 192GB of RAM. These servers will run a few slave instances each of production database servers and will perform the backups.  The write load is not excessive so a single server can easily handle the write load of all the MySQL instances.  The original idea was to configure them with raid-10 + LVM, making sure to stripe the LV when we need to and align the partition correctly.

We got decent tpcc performance, nearly 37k NoTPM using 5.6.11 and xfs.  Then, since ZFS on Linux is available and there is in house ZFS knowledge, we decided to reconfigure one of the server and give ZFS a try.  So I trashed the raid-10 arrays, configure JBODs and gave all those drives to ZFS (30 mirrors + spares + OS partition mirror) and I limited the ARC size to 4GB.  I don’t want to start a war but ZFS performance level was less than half of xfs for the tpcc test and that’s maybe just normal.  We didn’t try too hard to get better performance because we already had more than enough for our purpose and some ZFS features are just too useful for backups (most apply also for btrfs). Let’s review them.


ZFS does snapshot, like LVM but… since it is a copy on write filesystem, the snapshots are free, no performance penalty.  You can easily run a server with hundreds of snapshots.  With LVM, your IO performance drops to 33% after the first snapshot so keeping a large number of snapshots running is simply not an option.  With ZFS you can easily have:

  • one snapshot per day for the last 30 days
  • one snapshot per hour for the last 2 days
  • one snapshot per 5min for the last 2 hours

and that will be perfectly fine.  Since starting a snapshot take less than a second, you could even be more zealous.  Pretty interesting to speed up point in time recovery when you dataset is 700GB.  If you google a bit with “zfs snapshot script” you’ll many scripts ready for the task.  Snapshots work best with InnoDB, with MyISAM you’ll have to start the snapshot while holding a “flush tables with read lock” and the flush operation will take some time to complete.


ZFS can compress data on the fly and it is surprisingly cheap.  In fact the best tpcc results I got were when using compression.  I still have to explain this, maybe it is related to better raid controller write cache use.  Even the fairly slow gzip-1 mode works well.  The tpcc database, which contains a lot of random data that doesn’t compress well showed a compression ration of 1.70 with gzip-1.  Real data will compress much more.  That gives us much more disk space than we expected so even more snapshots!


With ZFS each record on disk has a checksum.  If a cosmic ray flip a bit on a drive, instead of crashing InnoDB, it will be caught by ZFS and the data will be read from the other drive in the mirror.

Better availability and disk usage

On purpose, I allocated mirror pairs using drives from different controllers.  That way, if a controller dies, the storage will still be working.  Also, instead of having 1 or 2 spare drives per controller, I have 2 for the whole setup.  A small but yet interesting saving.

All put together, ZFS on Linux is a very interesting solution for MySQL backup servers.  All backup solutions have an impact on performance with ZFS the impact is up front and the backups are almost free.

Share this post

Comments (44)

  • JDempster

    We’ve been using ZFS on FreeBSD for backup and production for a few years now.

    Performance was a big hit to start with. Mainly due to random IO caused by the COW (Copy on Write). SSD’s solved the issue for us, plus we still benefit from the improved speed of having SSD’s.

    SSD are more costly than spinning media, but the speed difference easily makes up the difference. Add to that the compression gain from ZFS and there really no cost difference.

    ZFS provides built in checksumming and double write buffering, so make sure these are turned off in InnoDB.

    May 24, 2013 at 7:21 am
  • Ricardo Santos

    We have been using MySQL over ZFS since 2011 for more than 500.000 databases (about 120.000 users)

    May 24, 2013 at 7:24 am
  • Nils

    Isn’t ZFS on Linux going through FUSE? That should be quite slow.

    May 24, 2013 at 9:03 am
  • Yves Trudeau

    @JDempster: I could give the SSD to xfs also with Flashcache and the like. Don’t forget all these jbods have a raid controller write cache in front of them. I also did the Innodb tuning.

    @Nils: no fuse, direct kernel support. Look here: http://zfsonlinux.org/

    May 24, 2013 at 9:52 am
  • Gary E. Miller

    Wow, I pulled all the 300GB drives out of my MySQL servers years ago. I don’t need the space of a 2TB drive, but they are way faster than the old drives. So fast, that for linear writes, like httpd logs, they are as fast as older SSDs. But the newer SSDs are as big, and much faster than the 300GB HDDs, on MySQL type loads.

    May 24, 2013 at 3:48 pm
  • Valentine Gostev

    Nice post Yves!

    Had any chance to try using ZFS volumes via iSCSI? Also why no mention of using SSD drives for L2ARC?

    May 26, 2013 at 12:45 am
  • Janne Enberg

    Would definitely be interesting to see the performance on a proper ZFS implementation, e.g. OpenIndiana or even FreeBSD

    May 27, 2013 at 4:32 am
  • Raghavendra

    a) Any CoW-based filesystem – zfs, btrfs bring with them performance
    penalities associated with CoW. Now, they have their own
    optimizations for this (like larger btrfs metadata block size

    b) For #a, it would be nice if btrfs (from 3.9) and zfsonlinux (latest) are

    c) Regarding integrity, XFS from 3.10 is going to have metadata

    d) What compression algo does ZfsOnLinux use? Is gzip-1 by
    default? btrfs supports LZO etc. too I believe.

    e) Another latest entrant to filesystem is Tux3 which is showing impressive results in performance.

    May 27, 2013 at 11:52 am
  • Yves Trudeau

    @Janne: I don’t think the Linux port is that inferior to make a sizeable difference. Seems very consistent with Solaris at least.

    @Raghavendra: b) I used the latest zfsonlinux. c) metadata checksum is way short than full data checksum. d) I compared lzjb (default), lz4, gzip-1 and gzip-6. I may have found the main performance bottleneck (benchmark in progress), if I am right, I’ll write another blog post about it.

    May 28, 2013 at 9:26 am
  • Nils

    From what I hear, XFS is never going to have data checksums. It’s a philosophical decision that this should be done in the application.

    May 28, 2013 at 9:38 am
  • Raghavendra

    Yes, since XFS uses transactional model, metadata checksums may
    be sufficient.

    With full data checksums there are going to be penalties – with
    both time (even with something like adler32) and space (to store
    it) for a 4k block size.

    lzjb seems interesting, hadn’t heard of it before.


    In a way that is true. Imagine running InnoDB over XFS, you will
    end with two sets of checksums – one that of InnoDB and other of
    XFS itself. In case of InnoDB, the block size – 16k – is much
    larger than 4k and so overhead may be lesser.

    May 28, 2013 at 10:52 am
  • Miklos Szel

    Nice article!

    Actually I ran into performance issues during testing ZFS on Linux, it’s true it was some months before so maybe I should give it another try.

    “Additionally, it should be made clear that the ZFS on Linux implementation has not yet been optimized for performance. As the project matures we can expect performance to improve.”

    May 29, 2013 at 6:57 am
  • Kyle Hailey

    Nice to see this writeup on ZFS and MySQL.
    One question, why with 192GB of memory, did you limit the ARC to 4GB?
    Also wondering if any one is using ZFS and MySQL to thin clone databases meaning using a snapshot of a source database and provisioning clones from those snapshots.
    We are writing a book on Data Virtualization for Databases meaning using a single set of datafiles that support multiple copies of the source database. Would be interested if someone wanted to contribute on the MySQL side.
    Interesting that on Wikipedia Data Virtualization is talked about as data source aggregation but as someone pointed out aggregating data sources should be called data transparency meaning that the source is not seen for example aggregating Oracle, SQL Server and MySQL into one source that the user or application sees. On the other hand following the VMware paradigm where one machine supports multiple virtual machines data virtualization is where one set of data supports multiple consumers and appears that each has an exclusive copy of that source data. (see http://www.dbms2.com/2013/01/05/database-virtualization-data/)
    – Kyle Hailey

    May 31, 2013 at 5:27 pm
  • whatever


    The problem with your config is this: “limit ARC size to 4GB”

    ZFS needs a significant portion of ram to work with. I agree that limiting ARC to small amount of ram prevents double caching by the file system vs the InnoDB caching. However, limited ARC size also limits ZFS file system’s internal metadata caching(which l2arc and deduplication both need plenty of)

    You need to sacrifice a little ram for ZFS metadata caching to make ZFS fast. NexentaStor recommends minimum 1-2GB ram for every 1TB raw storage. So in essence, for a 64x300GB storage system, leaving 20GB for ZFS is worthwhile. (All ZFS systems should max ram slots with 16gb dual rank dimms, which are relatively cheap these days, that means 288GB for Xeon 5600 series servers, and 384GB for Xeon E5-2600 servers, so sacrificing 10% ram for ZFS metadata is worthwhile.)

    June 4, 2013 at 2:16 am
  • whatever

    Another thing: don’t use gzip compression. LZ4 is the best right now with benchmarks showing up to 50% more performance than even lzjb. With future LZ4 compression for L2ARC combined with L2ARC persistence, it is wise to cover the entire 1TB databases with multiple MLC L2ARCs like Samsung 840 Pro or Intel DC S3700 SSDs.

    BTW, Is percona ever going to support OmniOS?

    June 4, 2013 at 2:24 am
  • Yves Trudeau

    @whatever: I compared all compression algo with tpcc and found marginal difference so I was likely not hitting a bottleneck caused by compression. If I use SSD for L2ARC, I’ll need to do the same for xfs… I am redoing my tpcc test with primarycache=metadata, if giving more arc to metadata make sense, I should already see a difference right there. Will keep you posted. It is not an easy move for a MySQL dba to steal 20GB memory from the Innodb buffer pool to give it to the filesystem. I took a look at OmniOS but frankly it was the first time I read something about it. How different is it from Solaris in term of porting?

    @kyle: Yes, that would be interesting.

    June 7, 2013 at 9:35 am
  • whatever


    1. Yes, it is not easy going from Linux to illumos ZFS by giving up 10% total ram for metadata caching, but it is just the price you should be willing to pay for ZFS. ZFS LZ4 compression is the best thing since the invention of L2ARC and ZIL. All I am dreaming now is to have LZ4 compressed persistent L2ARC, which is coming soon.

    2. OmniOS rocks. It is the only stable illumos distro designed for server usage, and it can be bought with an optional tech support agreement. OpenIndiana is desktop oriented, and not quite stable, IMHO, and its lead developer just quit, so I consider that pretty much dead project.

    The only thing I don’t like about OmniOS is that the illumos base currently lacks a potent open source clustering solution. RSF-1 costs a lot. So OmniOS should only be used as MySQL read slaves behind a pair of Linux based MySQL Masters using Pacemaker. Maybe you guys can get Galera working on OmniOS to solve our lack of clustering problem on illumos?

    June 8, 2013 at 11:49 am
  • Grenville Whelan


    “RSF-1 costs a lot”

    We sell RSF-1 for ZFS-storage systems for USD $5,000 per 2-node cluster including first year’s support and maintenance.
    “Costs a lot” is subjective but there are a large amount of features available in RSF-1 that bring enterprise HA features to the party, including COMSTAR/ALUA failover support, multi-node cluster support, strong disk-fencing safety mechanisms (including STONITH/SMITH support). It is also available for Solaris / OpenIndiana / illumos flavours / Linux and FreeBSD.

    June 21, 2013 at 5:59 am
  • Lari Pulkkinen

    We have also some experience on MySQL running on ZFS, mainly with InnoDB tables. We did some benchmarking using the storage via NFS and iscsi, since at that point the only option to use ZFS on Linux was FUSE (or we didn’t know about that project then). Performance varies a lot depending on NFS/iscsi configuration but we made similar conclusions as you did, for example snapshotting abilities are awesome.

    We had also some SSD on L2ARC & ZIL but with nfs or iscsi, there were some difficulties with the configuration, data wasn’t always flowing into ZIL causing a huge write performance of course. Compared to a SAN with SSD drives (plain iscsi), performance of the ZFS setup was only about one third of SSD. This was measured using tpcc-mysql.

    June 24, 2013 at 6:11 am
  • nv

    Our team has been testing a ZFS snapshot backup solution and is also considering testing an XFS/ZVOL db server solution (for the production DB servers) – it seems too good to be true.. the power of ZFS snapshots and the storage capacity is dream like. However, I’m still not sure if it really is truly production ready though.

    Although it is being flaunted as “ready for widescale deployment” I’m still somewhat skeptical. Apart from the licensing issues, the current version is at 0.6.1 and I’m sure that there are still going to be some fairly major code changes in order to accomodate the missing features – is it not a bit risky to setup a filesystem for corporate data that is still “under construction”?

    July 1, 2013 at 3:12 am
  • Richard Yao

    For a database, you will likely want to setup a dedicated dataset with recordsize=16K (the record size used by innodb according to someone above), primarycache=metadata, secondarycache=metadata and compression=lz4. That should give you best performance.

    July 2, 2013 at 2:29 am
  • Yves Trudeau

    @nv: I can’t say the Linux ZFS port is ready for wide scale deployments, we have been using intensively only for about 1 month. So far so good though, no issue. On the port page, they mention performance is not optimal yet though.

    @Richard: I used about the same setting except for secondarycache, I’ll try this. Surprisingly, the compression algorithm changed the tpcc results unsignificantly, I tried gzip-1, gzip-6, lzjb, lz4. They performed quite similarly, topping at ~50% of the xfs/lvm twin server. Likely I am hitting another bottleneck, maybe the txg_sync stuff. I’ll likely write other blog posts about these and I also have some lvm stuff coming.

    July 3, 2013 at 9:21 am
  • nv

    If its just a case of non-optimal performance – I guess the trade-off for features is more than worthwhile. Our main concern here is stability and robustness – especially when using this tech for a backup solution, performance isn’t a major stopper.

    Perhaps it might be worthwhile sticking to a Solaris based distro and compiling Percona server from source for a ZFS backup solution initially and move to zfsonlinux at a later stage when it is more mature…

    July 4, 2013 at 5:25 am
  • whatever

    @Grenville Whelan

    Everything is relative, Grenville. What I meant by lot is relative to Linux/DRBD/Pacemaker for a pair of Linux MySQL Masters.

    Sometimes the cluster solution providers just don’t get it. $5000 is little when you are a bank doing HA credit card transactions, who by the way, can print a gazillion bazillion dollars on the fly by giving BS Bernanke a call. If you are a startup or a web company < $1 million capitalization, your pair of Database servers probably cost $5000 total. Adding another $5k for HA? No thanks, Linux+DRBD+Pacemaker will be the solution. You will end up running XFS on linux on Intel SSDs instead of ZFS+RSF-1.

    All I want to say is, HA is not a feature but a requirement for even a lot of smaller companies. RSF-1 is the only HA solution I know for illumos. If RSF-1 is smart, it should reposition itself to be the defacto HA solution for the open source illumos. At $5K a pair, you just won't be.

    July 4, 2013 at 5:39 am
  • Nils

    The question I’d ask myself is, would I want to do business with a company that doesn’t have $5000 to spend?

    July 4, 2013 at 5:44 am
  • whatever

    The question is: why spend $5000+ an annual subscription* 15 times market PE for something that Pacemaker does for free? Yes, relative to Oracle Solaris Cluster, RSF-1 is a steal. But relative to Pacemaker?

    As much as I love ZFS on OmniOS, because of the lack of free potent HA solution for illumos, I am only using it as MySQL slaves behind a pair of Linux Masters.

    RSF-1 is great. All I am saying is that it is missing the point when Linux HA solutions are abundant. It is really dumb to say that one should not deal with a company that doesn’t want to spend $5000. Most web companies do have $5000 to spend, but interestingly, just not on a clustering solution because they have other more important stuff to spend it on. RSF-1 is great for ZFS storage servers with many spindles. I do hope that RSF-1 can work with Napp-it to have HA on Napp-it.

    July 4, 2013 at 6:04 am
  • Grenville Whelan

    I believe the reason companies spend $5,000 on RSF-1 is because they not only want enterprise grade features to support their critical storage infrastructure, but because they also want professional 24×7 support and know there is a company behind the technology that continue to build and innovate within the framework of a commercial contract and most importantly, somebody to beat up if anything goes wrong.

    Sure there are plenty of free / open source technologies that can also do the job, but the ongoing development and support (and bespoke client integration) rarely falls within a contractual framework and you can be left on your own or at the goodwill of contributors to help.

    Like most things in life, horses for courses but the choice are available.

    July 4, 2013 at 6:14 am
  • whatever

    The only reason companies spend $5000+subscription on RSF-1 is because Oracle closed the free Solaris Cluster and started charging $3000 per processor(measured by cores). Before Oracle acquired Sun, who the hell would want to buy RSF-1 when Solaris 10 was free and Solaris Cluster was also free?

    RSF-1 got lucky because Oracle went nuts on the banks who depend on Solaris Cluster. My perspective is this: now that RSF-1 is the only Oracle-free solution for illumos, there are a lot of potential for you guys if you guys aren’t the type who also want an arm and leg for your solution because Oracle went for our throat?

    Food for thought really.

    July 4, 2013 at 6:26 am
  • Nils

    Of course you can get Pacemaker for free. But installing and maintaining it will most likely not be free, you will need someone who knows what he’s doing for that and that’s a kind of skill set that commands a premium. What you paid in Hardware probably pales in comparison to that.

    I don’t think the price tag is unreasonable, especially compared to what Oracle charges. I would be afraid, however, of the vendor lock-in. As you have seen with Oracle, they can really put the screws to you once they hooked you. You might not like it because it’s out of your price range for that particular tasks, but that’s just, like, your opinion man 😉

    Have you looked into FreeBSD as an alternative? It has had support for ZFS for a few years now, it’s free and there are some HA solutions. https://wiki.freebsd.org/ZFS

    July 4, 2013 at 6:56 am
  • whatever

    First of all, I do see different use cases for Solaris Cluster, RSF-1 and Linux Pacemaker. You use Solaris11+Solaris Cluster for your bank transactions and bank applications, where you can just “print” or “QE” your way out of the price tag. You use RSF-1 for 100TB+ ZFS storage servers so $5000 for RSF-1 is a small % of total purchasing cost, and you use Linux Pacemaker for horizontal scaled web database servers when you want friction=0.

    I did indeed look into FreeBSD/HAST/uCarp/ZFS as an alternative. But settled on Linux/DRBD/Pacemaker for the Masters purely for maturity reasons. FreeBSD ZFS isn’t as mature, and pacemaker is a better Cluster Resource Manager. I don’t think FreeBSD/HAST/uCarp/ZFS has a Cluster Resource Manager. uCarp is only an IP resource failover agent. It is so messed up. If RSF-1 had a community edition(with zero subscription and no support or community support only), then my database tier would become entirely Illumos based.

    July 4, 2013 at 11:00 am
  • Richard Yao

    @Yves I encourage you to file an issue in the ZFSOnLinux issue tracker so that we can track it. There are some performance improvements planned for 0.6.2 and even more improvements planned for 0.6.3. The performance improvements that will be in 0.6.2 are already in HEAD:


    I expect us to port the following performance related commits in 0.6.3:


    In addition, there is the following change by the ZFSOnLinux project which is under review:


    It is probably somewhat obvious from this post that I participate in upstream ZFSOnLinux development, but for full disclosure, I am the Gentoo Linux ZFS maintainer. 🙂

    P.S. I realize that you are not seeing much difference in performance between various compression algorithms in your workload, but my expectation is that LZ4 will turn out to be the best when whatever bottleneck you are hitting is resolved.

    July 4, 2013 at 6:26 pm
  • Vadim Tkachenko


    I posted issue with ZFS and O_DIRECT about 2 years, and there is no updates.
    This gives me an impression that ZFSonLinux developers are not really receptive to external bug reporters.

    July 6, 2013 at 7:38 pm
  • Richard Yao

    Vadim, O_DIRECT was designed for in-place filesystems to allow IO to bypass the filesystem layer and caching. A literal implementation of O_DIRECT in a copy-on-write filesystem like ZFS is not possible (because checksum and parity calculations must be done). It is possible to implement it by effectively ignoring the O_DIRECT flag, but I imagine that would defeat the purpose. I imagine is the main reason the solution where O_DIRECT is ignored has not been implemented is that Linux uses a different code path for O_DIRECT and time spent implementing the separate code path is time that could be spent on other bugs.

    Most of the development of ZFSOnLinux over the past two years has focused on making it ready for the first stable release. Adding O_DIRECT support did not help contribute to that, so it was a low priority. I imagine that adding O_DIRECT support would occur rather quickly if someone were to write a patch to add it that works. However, adding it would be misleading unless O_DIRECT is implemented in a way that provided some kind of tangible benefit over not using it.

    Lastly, ZFSOnLinux development is done by a few professional developers at LLNL and volunteers, such as myself, that happen to use it. LLNL uses ZFSOnLinux as an OSD for the Lustre filesystem on the Sequoia suprecomputer while volunteers tend to use it on either servers or desktops. O_DIRECT is currently scheduled for a release rather far in the future because none of us have any need for O_DIRECT. It should be possible to configure your software to not use O_DIRECT, so doing it sooner does not seem like it should be a priority.

    July 7, 2013 at 8:56 am
  • javi

    Can anyone help?
    I’m running virtual machines on a zfs pool storage and I’m having heavy performance degradation.
    The disk images of this machines are allocated in a cabin that is running over Solaris.
    The pool is made up of 26 devices: 10 mirrors (600GB SAS disks) + 1 mirror log + 4 cache.
    This virtual machines are running over Linux and using apache and mysql services. I think that the main problem is the small pool recordsize with hard mysql service, but I’m not sure what size I have to use.
    At first I thought that the best option was to set the same recordsize to the pool that the blocksize filesystem machines, 4K in this case (actual configuration with performance degradation), but in some sites I can read that is not an optimal practise and they recommend 64K or 128K block recordsize for filesystem in pool.
    I read that if the pool is created with small recordsize, when the pool is fragmented it’s more difficult to find empty blocks in each metaslab, than if the recordsize it’s bigger.
    I’m bit newbie in ZFS and I’m not sure what’s the optimal value for my purpose.
    It’s a good idea to set recordsize pool at 64/128K for virtual machines that are running over 4K blocksize filesystem?

    Thanks in advance.

    October 28, 2013 at 5:26 pm
  • pondix

    Are you running snapshots or auto-snapshots? Did you set the ashift=12 param?

    October 28, 2013 at 5:32 pm
  • javi

    At first I used auto-snapshots, but then appear the first performance problems, then I deleted and disabled the auto-snapshots.
    After this action the performance was improve, but some weeks later problems comeback.
    I know that the problem was the space occupied by the snapshots in pool, not the fact to use snapshots.
    I’m having the problems now, and the unique solution that I have is delete some GB of data to get some empty space and defragment a bit the pool.
    The first time the problems happens with 72% of pool usage, but now I still having the problems with around 60%.
    I’m not using now ashift=12 param.

    October 28, 2013 at 6:37 pm
  • pondix

    I’ve seen 15% increase in performance caused by ashift=12, but you have to rebuild your pool =(

    Try this – it should help with your memory management, allocate a fixed amount of RAM to your L2ARC:

    # monitor it like this (its the c_xxx values):
    cat /proc/spl/kstat/zfs/arcstats
    c 4 536870912
    c_min 4 67108864
    c_max 4 536870912
    size 4 536948488

    # manage it like this:
    vi /etc/modprobe.d/zfs.conf

    # add the following lines – I would allocate as much RAM as possible – say dedicate about 25% of memory or at least 2GB
    options zfs zfs_arc_min=536870912
    options zfs zfs_arc_max=2147483648

    # Then save the file and change these settings by running the following commands on your zpool and zfs:
    zfs set compression lzjb # set both on zpool and zfs settings works great – especially for DBs!
    zfs set dedup=off # set it off both of your zpool and zfs settings – unless you have LOADS of RAM to throw at it

    Consider using ZVOL and create an XFS / EXT4 volume instead of ZFS or even allocate the disks directly to the VMs for better I/O.

    Remember, snapshots cause COW overhead and have increased memory requirements – monitor your ZFS statistics, even try graphing them to keep track of the FS health.

    October 28, 2013 at 7:10 pm
  • javi

    Hi pondix,

    In this moment I can’t rebuild the pool, the system is in production now…
    I want to take out all data from this pool and then rebuild it with more vdev’s, and also modify the pool recordsize.
    Also I will consider the ashift param