Buy Percona ServicesBuy Now!

About ZFS Performance

 | May 15, 2018 |  Posted In: AWS, Hardware and Storage, MySQL, ZFS

PREVIOUS POST
NEXT POST

If you are a regular reader of this blog, you likely know I like the ZFS filesystem a lot. ZFS has many very interesting features, but I am a bit tired of hearing negative statements on ZFS performance. It feels a bit like people are telling me “Why do you use InnoDB? I have read that MyISAM is faster.” I found the comparison of InnoDB vs. MyISAM quite interesting, and I’ll use it in this post.

To have some data to support my post, I started an AWS i3.large instance with a 1000GB gp2 EBS volume. A gp2 volume of this size is interesting because it is above the burst IOPS level, so it offers a constant 3000 IOPS performance level.

I used sysbench to create a table of 10M rows and then, using export/import tablespace, I copied it 329 times. I ended up with 330 tables for a total size of about 850GB. The dataset generated by sysbench is not very compressible, so I used lz4 compression in ZFS. For the other ZFS settings, I used what can be found in my earlier ZFS posts but with the ARC size limited to 1GB. I then used that plain configuration for the first benchmarks. Here are the results with the sysbench point-select benchmark, a uniform distribution and eight threads. The InnoDB buffer pool was set to 2.5GB.

In both cases, the load is IO bound. The disk is doing exactly the allowed 3000 IOPS. The above graph appears to be a clear demonstration that XFS is much faster than ZFS, right? But is that really the case? The way the dataset has been created is extremely favorable to XFS since there is absolutely no file fragmentation. Once you have all the files opened, a read IOP is just a single fseek call to an offset and ZFS doesn’t need to access any intermediate inode. The above result is about as fair as saying MyISAM is faster than InnoDB based only on table scan performance results of unfragmented tables and default configuration. ZFS is much less affected by the file level fragmentation, especially for point access type.

More on ZFS metadata

ZFS stores the files in B-trees in a very similar fashion as InnoDB stores data. To access a piece of data in a B-tree, you need to access the top level page (often called root node) and then one block per level down to a leaf-node containing the data. With no cache, to read something from a three levels B-tree thus requires 3 IOPS.

Simple three levels B-tree

The extra IOPS performed by ZFS are needed to access those internal blocks in the B-trees of the files. These internal blocks are labeled as metadata. Essentially, in the above benchmark, the ARC is too small to contain all the internal blocks of the table files’ B-trees. If we continue the comparison with InnoDB, it would be like running with a buffer pool too small to contain the non-leaf pages. The test dataset I used has about 600MB of non-leaf pages, about 0.1% of the total size, which was well cached by the 3GB buffer pool. So only one InnoDB page, a leaf page, needed to be read per point-select statement.

To correctly set the ARC size to cache the metadata, you have two choices. First, you can guess values for the ARC size and experiment. Second, you can try to evaluate it by looking at the ZFS internal data. Let’s review these two approaches.

You’ll read/hear often the ratio 1GB of ARC for 1TB of data, which is about the same 0.1% ratio as for InnoDB. I wrote about that ratio a few times, having nothing better to propose. Actually, I found it depends a lot on the recordsize used. The 0.1% ratio implies a ZFS recordsize of 128KB. A ZFS filesystem with a recordsize of 128KB will use much less metadata than another one using a recordsize of 16KB because it has 8x fewer leaf pages. Fewer leaf pages require less B-tree internal nodes, hence less metadata. A filesystem with a recordsize of 128KB is excellent for sequential access as it maximizes compression and reduces the IOPS but it is poor for small random access operations like the ones MySQL/InnoDB does.

To determine the correct ARC size, you can slowly increase the ARC size and monitor the number of metadata cache-misses with the arcstat tool. Here’s an example:

You’ll want the ‘mm%’, the metadata missed percent, to reach 0. So when the ‘arcsz’ column is no longer growing and you still have high values for ‘mm%’, that means the ARC is too small. Increase the value of ‘zfs_arc_max’ and continue to monitor.

If the 1GB of ARC for 1TB of data ratio is good for large ZFS recordsize, it is likely too small for a recordsize of 16KB. Does 8x more leaf pages automatically require 8x more ARC space for the non-leaf pages? Although likely, let’s verify.

The second option we have is the zdb utility that comes with ZFS, which allows us to view many internal structures including the B-tree list of pages for a given file. The tool needs the inode of a file and the ZFS filesystem as inputs. Here’s an invocation for one of the tables of my dataset:

The last section of the above output is very interesting as it shows the B-tree pages. The ZFSB-tree of the file sbtest1.ibd has four levels. L3 is the root page, L2 is the first level (from the top) pages, L1 are the second level pages, and L0 are the leaf pages. The metadata is essentially L3 + L2 + L1. When you change the recordsize property of a ZFS filesystem, you affect only the size of the leaf pages.

The non-leaf page size is always 16KB (4000L) and they are always compressed on disk with lzop (If I read correctly). In the ARC, these pages are stored uncompressed so they use 16KB of memory each. The fanout of a ZFS B-tree, the largest possible ratio of a number of pages between levels, is 128. With the above output, we can easily calculate the required size of metadata we would need to cache all the non-leaf pages in the ARC.

So, each of the 330 tables of the dataset has 1160 non-leaf pages and 145447 leaf pages; a ratio very close to the prediction of 0.8%. For the complete dataset of 749GB, we would need the ARC to be, at a minimum, 6GB to fully cache all the metadata pages. Of course, there is some overhead to add. In my experiments, I found I needed to add about 15% for ARC overhead in order to have no metadata reads at all. The real minimum for the ARC size I should have used is almost 7GB.

Of course, an ARC of 7GB on a server with 15GB of Ram is not small. Is there a way to do otherwise? The first option we have is to use a larger InnoDB page size, as allowed by MySQL 5.7. Instead of the regular Innodb page size of 16KB, if you use a page size of 32KB with a matching ZFS recordsize, you will cut the ARC size requirement by half, to 0.4% of the uncompressed size.

Similarly, an Innodb page size of 64KB with similar ZFS recordsize would further reduce the ARC size requirement to 0.2%. That approach works best when the dataset is highly compressible. I’ll blog more about the use of larger InnoDB pages with ZFS in a near future. If the use of larger InnoDB page sizes is not a viable option for you, you still have the option of using the ZFS L2ARC feature to save on the required memory.

So, let’s proposed a new rule of thumb for the required ARC/L2ARC size for a a given dataset:

  • Recordsize of 128KB => 0.1% of the uncompressed dataset size
  • Recordsize of 64KB => 0.2% of the uncompressed dataset size
  • Recordsize of 32KB => 0.4% of the uncompressed dataset size
  • Recordsize of 16KB => 0.8% of the uncompressed dataset size

The ZFS revenge

In order to improve ZFS performance, I had 3 options:

  1. Increase the ARC size to 7GB
  2. Use a larger Innodb page size like 64KB
  3. Add a L2ARC

I was reluctant to grow the ARC to 7GB, which was nearly half the overall system memory. At best, the ZFS performance would only match XFS. A larger InnoDB page size would increase the CPU load for decompression on an instance with only two vCPUs; not great either. The last option, the L2ARC, was the most promising.

The choice of an i3.large instance type is not accidental. The instance has a 475GB ephemeral NVMe storage device. Let’s try to use this storage for the ZFS L2ARC. The warming of an L2ARC device is not exactly trivial. In my case, with a 1GB ARC, I used:

I then ran ‘cat /var/lib/mysql/data/sbtest/* > /dev/null’ to force filesystem reads and caches on all of the tables. A key setting here to allow the L2ARC to cache data is the zfs_arc_meta_limit. It needs to be slightly smaller than the zfs_arc_max in order to allow some data to be cache in the ARC. Remember that the L2ARC is fed by the LRU of the ARC. You need to cache data in the ARC in order to have data cached in the L2ARC. Using lz4 in ZFS on the sysbench dataset results in a compression ration of only 1.28x. A more realistic dataset would compress by more than 2x, if not 3x. Nevertheless, since the content of the L2ARC is compressed, the 475GB device caches nearly 600GB of the dataset. The figure below shows the sysbench results with the L2ARC enabled:

Now, the comparison is very different. ZFS completely outperforms XFS, 5000 qps for ZFS versus 3000 for XFS. The ZFS results could have been even higher but the two vCPUs of the instance were clearly the bottleneck. Properly configured, ZFS can be pretty fast. Of course, I could use flashcache or bcache with XFS and improve the XFS results but these technologies are way more exotic than the ZFS L2ARC. Also, only the L2ARC stores data in a compressed form, maximizing the use of the NVMe device. Compression also lowers the size requirement and cost for the gp2 disk.

ZFS is much more complex than XFS and EXT4 but, that also means it has more tunables/options. I used a simplistic setup and an unfair benchmark which initially led to poor ZFS results. With the same benchmark, very favorable to XFS, I added a ZFS L2ARC and that completely reversed the situation, more than tripling the ZFS results, now 66% above XFS.

Conclusion

We have seen in this post why the general perception is that ZFS under-performs compared to XFS or EXT4. The presence of B-trees for the files has a big impact on the amount of metadata ZFS needs to handle, especially when the recordsize is small. The metadata consists mostly of the non-leaf pages (or internal nodes) of the B-trees. When properly cached, the performance of ZFS is excellent. ZFS allows you to optimize the use of EBS volumes, both in term of IOPS and size when the instance has fast ephemeral storage devices. Using the ephemeral device of an i3.large instance for the ZFS L2ARC, ZFS outperformed XFS by 66%.

PREVIOUS POST
NEXT POST
Yves Trudeau

Yves is a Principal Architect at Percona, specializing in distributed technologies such as MySQL Cluster, Pacemaker and XtraDB cluster. He was previously a senior consultant for MySQL and Sun Microsystems. He holds a Ph.D. in Experimental Physics.

12 Comments

  • Why would you use a L2ARC if you could increase the innodb pool buffer size about the same size? This does not make any sense and you can not compare the cache results! The pool buffer should outperform the L2ARC if warmed up, too.

    • Of course it would be faster to have half a TB of buffer pool but that would be way more expansive. A r4.16xlarge has the RAM but it costs about 27 times more than an I3.large instance, that could be a bummer…

    • That’s the problem with doing this with MySQL right there. The ARC, page cache and buffer pool will compete for memory, worst case is having the same page buffered thrice.

      • Thanks for pointing that out Nils. A pool buffer is EXACTLY the same as the L2ARC of ZFS. This is absolutly nonse what you benchmarked.

        @Yves: Compare the same size of memories on both variants.

        16 GB RAM VM.

        Bench 1: 12 GB pool buffer
        vs
        Bench 2: 6 GB pool buffer + 6 gb zfs cache
        vs
        Bench 3: 6 GB pool buffer

        I expect in a production env:
        Bench 1 > Bench 2 > Bench 3

        In you test:
        Bench 2 > Bench 3 > Bench 1

        Because: You have a warmed up cache vs. a cold cache.

        • The dataset size is 850GB, that makes the buffer pool size almost irrelevant since a very tiny fraction of the dataset will be cache in it anyway. A buffer pool of 12GB represents 1.4% of the dataset while a buffer pool of 6GB, 0.7%. In both cases, around 99% of the queries will require a read from the EBS volume. Now, the L2ARC in on disk, on the NVMe ephemeral drive and it can store about 600GB of the dataset with compression. That’s quite different. Now, 70% of the dataset is cached so only 30% of the queries will require a read from the EBS volume.

          • If you compare a server with hdd vs a server with ssd (NVMe) … that is like comparing a gasoline vs a diesel.

            You can also create a read-cache with Devicemapper and ext4. What would be a “fair” fight 😉 I’ve used this on a Xen Infrastructure and was quite nice.

            https://rwmj.wordpress.com/2014/05/22/using-lvms-new-cache-feature/

  • Great post Yves! Am I right by saying that you gave an additional advantage to XFS by not compressing the tables with the native InnoDB compression ? I recall someone choosing ZFS due several features, being one of those: compression on the filesystem performing better than on the tables.

    • That’s a good point Daniel, InnoDB would hit XFS performance. Since the load was IO bound, I wonder if the CPU load would rise enough with InnoDB compression to move the bottleneck to the CPU.

  • Doing first a benchmark in favour of XFS and then doing the same for ZFS is not a plausible benchmark methology. You already stated thst you could use bcache for XFS, but it is exotic.
    So in reality your last benchmark was good for ZFS because the system was able to use an extra device for performance tuning, that is no realistic result.

    About all benchmarks done by percona to show ZFS are completely flawed by design. Its a pitty because many people would be really interested in COMPARABLE results for xfs/ext4 to zfs and this means trying to achieve the same, e.g. bcache for xfs.

    • Although done for a very special use case, a general comparison of xfs/ext4/zfs can be found here: http://ebert.homelinux.org/publication1.html
      (the conference contribution about it: http://ebert.homelinux.org/pdf2.html)
      However, it doesn’t not test random read/writes since that was not a use case of the system in question.

Leave a Reply