October 25, 2014

FlashCache: more benchmarks

Previously I covered simple case with FlashCache, when data fits into cache partitions, now I am trying to test when data is bigger than cache.

But before test setup let me address some concern (which I also had). Intel X25-M has a write cache which is not battery backuped, so there is suspect you may have data loss in the case of power outage.
And in the case with FlashCache it would mean you can send your database to trash, as there is no way to recovery from that ( only restore from backup).
I personally did couple of power failure tests and there is article on this topic http://www.anandtech.com/show/2614/10. I did not see any data loss in my tests, and the article says that the write cache “..isn’t used for user data because of the risk of data loss, instead it is used as memory by the Intel SATA/flash controller for deciding exactly where to write data..”. So I assume we should be safe running Intel X25-M with enabled write cache.

Second issue I faced, and which took quite efforts to sort ( thanks Mohan Srinivasan, maintainer of FlashCache, for help with that) is 16KB alignment in XFS and FlashCache. As developers recommend to have 16KB block size in XFS and 16KB block size in FlashCache, I was not able to make it working, there too many moving parts – it’s XFS alignment by RAID strip size, XFS aligned file allocation, raid partition should be created aligned by 16K block, etc. In result I had too many cache misses
so results were not impressive, I had to move to standard for filesystems 4K blocks.

For benchmark I used sysbench oltp with 300 mln rows, which gives about 70GB of data. I allocated 35GB for cache on SSD and I used 16GB and 10GB buffer pool sizes. I run sysbench with special distribution, not uniform, the reason for that is that I do not really expect big improvement from clear uniform random hits, as in this case we will have 1/2 probability to go to disk read, and also there overhead in FlashCache on flushing and replacing pages. Special distribution is 75% of all requests are hits 1% of data, and rest data are accessed in 25% of cases. I think it also better describes real workload, as you do not expect all 100% of users coming to your website in the same time.

So full details are there
http://www.percona.com/docs/wiki/benchmark:flashcache:sysbench:300mln_rows:start

And short results:

buffer_pool, GBRAID10FlashCache 20%
16468.76985.47
10328.34704.36

As you see there is more 2x improvement, which is quite impressive.

Of course you do not have to believe benchmarks, so I encourage you to test yourself (my full scripts and mysql config files are on Wiki page).
There are binaries for CentOS 5.5 and Ubuntu 10.04 if you are not so much in kernel modules compiling:
http://percona.com/downloads/TESTING/FlashCache/centos-2.6.18-194.el5.tar.gz

http://percona.com/downloads/TESTING/FlashCache/ubuntu.10.04.kernel-2.6.32-21-server.tar.gz

About Vadim Tkachenko

Vadim leads Percona's development group, which produces Percona Clould Tools, the Percona Server, Percona XraDB Cluster and Percona XtraBackup. He is an expert in solid-state storage, and has helped many hardware and software providers succeed in the MySQL market.

Comments

  1. Nils says:

    Is it even possible to use 16kb Blocks with XFS? I remember the limit to be <= PAGESIZE with Linux.

  2. peter says:

    Vadim,

    Great gains considering you could just use low cost X25-M
    It would be good to see how it impacts something like TPC-C too

    Regarding Durability the Flash itself is not the only moving part – it is a good question whenever all changes to the cache become immediately durable. Did you try shutting off power while under load with Flash Cache and checking if database recover completely ?

    Also is there write through mode available ? I believe for some workloads it will be enough to achieve performance gains and it is a lot safer. It also would allow to use single SSD for the cache more or less safely as in writeback cache this is effectively as running on SSD with no RAID.

  3. domas says:

    @Nils, it doesn’t matter how big file system block is, if you have contiguous 16k data access over 4k block size FS, they will be all done as single read. So you can use 16kb blocks on top, all you need is having them aligned, so that you don’t cross boundaries too much.

  4. domas says:

    oh, and also, I think that ‘special’ is flawed, as 1% will go to buffer pool, not flash cache.

    you need to test with distributions that have curved properties, as the middle part of the curve is what flashcache targets.

  5. Vadim says:

    domas,

    I agree that special is not fully representative, however 25% requests still access to full data range,
    and it is not full BP hits. And from numbers we can see that cache is still effective even in this case.

  6. mohan says:

    Peter

    Flashcache does not support write through (yet). It is writeback only.

    I have an implementation of write through (outside of flashcache, so it would have to be a different module) that I need to dust up and test some more. I wanted to get the writeback cache out first, because more workloads will benefit from a writeback cache.

  7. Evan Jones says:

    The Intel X25-M is *not* durable with the write cache enabled, at least not with my tests. I have a brand new X25-M G2, and here is how I get data loss on ext4 with write barriers *enabled*.

    1. Create a new 64 MB file, filled with zeros.
    2. fsync()
    3. Begin overwriting 4kB blocks in this file, doing fdatasync() after each block.
    4. Pull the power out of the SSD (critical: when pulling power out of the server, the data seems to survive)
    5. Reboot and look at the file on disk.

    When I do this with the write cache enabled, I lose data. With the write cache disabled, it seems to work.

  8. Evan Jones says:

    Actually, now that I am repeating this experiment, I’m not losing data. Perhaps I have made a mistake. I will repeat this a few times and see what happens …

  9. Vadim says:

    Evan,

    Thanks,
    It would be very useful to have real result, if there is possibility to lose data or not.
    If you have lost your data you may contact Intel, as this is not what expected, as I understand.

  10. Evan Jones says:

    I’ve carefully repeated my experiments, and now I am more confident: My Intel X25-M G2 loses data, even with the write cache disabled, when writes are larger than 4 kB. I’m hoping to contact someone at Intel who might be able to say if this is expected or not. However, this means that if you really want your MySQL commits to be durable, you cannot use this disk, since InnoDB log writes are occasionally larger than 4 kB. For details, see the following article, which I will keep updated as I learn more:

    http://evanjones.ca/intel-ssd-durability.html

  11. Vadim says:

    Evan,

    Thank you for sharing results, that’s scary.
    Do you have data loss even with disable write cache ?

    Did you try XFS ?

  12. Evan Jones says:

    Good suggestion. I repeated the experiment with XFS. It lost data on the SDD with the write cache disabled in 2 out of 5 attempts. Data was not lost on my two magnetic disks in 5 out of 5 attempts.

  13. Henrik says:

    @Evan: Did you (or others) do similar pull-the-ssd-power-plug tests with a X25E? I’m curious to learn if that suffers the same issue as you see with the X25M-G2.

  14. Henrik says:

    @Evan: Just noticed this from your other page on the subject (http://evanjones.ca/intel-ssd-durability.html). ” The good news is that I was able to test Intel’s “enterprise” SSD (X25-E), and it works as expected.” Just what I wanted to hear :)

  15. Evan Jones says:

    Yes, in my tests it appeared to work. However, I was testing them on an older spare server. I did some tests with the X25E recently on a newer server, and I *may* have seen data loss. I’m not totally sure, however, because the tests weren’t very careful. Unfortunately, I haven’t had time to repeat the tests. In other words: if you are truly paranoid, you may want to test it yourself. The software is available on the web site you linked above.

Speak Your Mind

*