October 22, 2014

SSD, XFS, LVM, fsync, write cache, barrier and lost transactions

We finally managed to get Intel X25-E SSD drive into our lab. I attached it to our Dell PowerEdge R900. The story making it running is worth separate mentioning – along with Intel X25-E I got HighPoint 2300 controller and CentOS 5.2 just could not start with two RAID controllers (Perc/6i and HighPoint 2300). The problem was solved by installing Ubuntu 8.10 which is currently running all this system. Originally I wanted to publish some nice benchmarks where InnoDB on SSD outperforms RAID 10, but recently I faced issue which can make previous results inconsistent.

In short words using Intel SSD X25-E card with enabled write-cache (which is default and most performance mode) does not warranty storing all InnoDB transactions on permanent storage.
I am having some déjà vu here, as Peter was rolling this 5 years ago http://lkml.org/lkml/2004/3/17/188 regarding regular IDE disks, and I did not expect this question poping up again.

Long story is:
I started with puting XFS on SSD and running very primitive test with INSERT INTO fs VALUES(0) into auto-increment field into InnoDB table. InnoDB parameters are

Actually most interesting one are innodb_flush_log_at_trx_commit=1 and innodb_flush_method = O_DIRECT (I tried also default innodb_flush_method, with the same result), using innodb_flush_log_at_trx_commit=1 I expect to have all committed transactions even in case of system failure.

Running this test with default XFS setting I saw SSD was doing 50 writes / s, this is something so forced me to check results several times – come on, it’s SSD, we should have much more IO there. Investigations put me into barries/nobarriers parameters and with mounting -o nobarrier I got 5300 writes / s. Nice difference, and this is something we want from SSD.

Now to test durability I do plug off power from SSD card and check how many transactions are really stored – and there is second bumper – I do not see several last N commited transactions.

So now time to turn off write-cache on SSD – all transactions are in place now, but write speed is only 1200 writes / s, which is comparable with RAID 10

So in conclusion to warranty Durability with SSD we have to disable write-cache which can affect performance results significantly (I have no results on hands, but it is to be tested).

What about LVM there ? Well, we often recommend to use LVM for backup purposes (even recent results are bad, we have no good replacement yet) and I tried LVM under XFS. With write-cache ON and default mount options (i.e. with barrier) I have 5250 writes / s, this is because LVM ignores write barriers (see http://dammit.lt/2008/11/03/xfs-write-barriers/ ), but again with enable write-cache you may lose transactions.

So in final conclusion:
1. Intel SSD X25E is NOT reliable in default mode
2. To have durability we need to disable write cache ( with following performance penalty, how much we need to test yet)
3. Possible solution could be put SSD into RAID controller with battery-backup-ed write cache, but I am not sure what is good ones – another are for research
4. XFS without LVM is putting barrier option which decreases write performance a lot

About Vadim Tkachenko

Vadim leads Percona's development group, which produces Percona Clould Tools, the Percona Server, Percona XraDB Cluster and Percona XtraBackup. He is an expert in solid-state storage, and has helped many hardware and software providers succeed in the MySQL market.

Comments

  1. peter says:

    It would be very interesting to check with Intel what their stand is on this.

    This is what suppose to be “Enterprise” drives so it is very strange it comes in so unsafe option as a default. It would be also to check what floats on the SATA wire in this case – do these drive really ignore both “do not cache” flag which should be set for D_SYNC/O_DIRECT writes and cache flush which should be passed w fsync.

  2. rocky says:

    What kind of tool did you for testing writes ? and parameters ?

  3. Vadim says:

    rocky,

    it’s very simple PHP scripts
    < ?php

    mysql_connect(":/tmp/mysql.sock","root","");

    mysql_query("use test");
    while(true) {
    mysql_query("INSERT INTO fs VALUES (0)");

    echo mysql_insert_id();
    echo "\n";

    }
    ?>

    for w/s you can see iostat -dx 5

  4. Phil says:

    Did you try aligning the blocks in the FS to the blocks of the SSD?

    http://thunk.org/tytso/blog/2009/02/20/aligning-filesystems-to-an-ssds-erase-block-size/

  5. Sean says:

    Vadim, do you have any tests related to read oriented queries, ie 90%+ using innodb and myisam?

    -sean

  6. Vadim says:

    Phil,

    No I did not try that – do you think it will help with fsync ?

  7. Vadim says:

    Sean,

    Are you asking about reads on SSD or in general ?
    We have a lot of numbers, just need to sort it out :)

  8. Sean says:

    Hi Vadim,

    Yes, I’m interested in reads on SSD for both innodb and myisam, though primarily myisam given our environment houses static data (compressed tables). I’ve read the papers and had presentations from vendors, but have not had the opportunity to get some in-house. I’m curious to know how they performed in the described situation, but also the TCO, how to save on server hardware. Given SSD’s respond for reads within .2-.4 ms, does a server need 32,64,128G of memory anymore?

    Sean

  9. FYI, starting in 2.6.29, LVM will start respecting write barriers. (finally!)

  10. Mark Callaghan says:

    @Sean — there is a great paper on where to spend money (RAM, Flash, disk) and it address your question — spend more on Flash and less on RAM. I have not seen anyone publish benchmark results based on the ideas in the paper — http://www.cs.cmu.edu/~damon2007/pdf/graefe07fiveminrule.pdf

  11. TS says:

    @Vadim:

    2. To have durability we need to disable write cache. (Basically reducing random write IOPS from 5000+ to 1200ish.)

    Is this issue XFS specific or MySQL specific or is it hardware? Does it happen in EXT3 under Linux? If it is a file system issue or MySQL issue, then it is not too bad. If it is a hardware issue, basically Intel just did false advertising. The enterprise market requires write IOPS durability. If it is indeed a hardware issue, there is no reason why people would pay for the ridiculous price of the X25-E if you are forced to disable the write cache on it and accept 1200 random write IOPS compared to 5K+ random write IOPS. I can foresee a drive recall or at least a huge price reduction for the drive then followed by a PCB revision with supercapacitor backed PCB design.

    This issue must be broadcast to the entire SQL database community. I am sure a lot of people are indeed picking the X25-E on the DB servers right now and they might be risking their data. Preferably, Intel must give an official word on this issue too.

  12. Vadim says:

    TS

    Currently I believe this is hardware issue, not filesystem or MySQL. It seems Intel X25E does not have battery to support write cache and just lose it at power failure. To claim it 100% we need to run some more experiments, but I would not put my database that require 100% durability on it for now.

  13. Andi Kleen says:

    Sorry, but you realize that nobarrier is the likely cause for the data loss, right? With barriers
    XFS fsync (but not necessarily ext3 fsync) would wait for a write barrier on the log commit, and thus
    also for the data. O_SYNC might be different though.

    Basically you specified the “please go unsafe but faster” option and then complain that it is
    actually unsafe.

    I would recommend to do the power off test without nobarriers but write cache on.

    -Andi

  14. Vadim says:

    Andi,

    I wrote that in post. With barrier and write cache we have 50 writes / s, which I consider “not just slower” but disaster which I would not put on production system.

  15. Andi Kleen says:

    It’s the cost of hitting the media. Unsafety like you chose is always faster.

    BTW LVM has been fixed in 2.6.29: if you only have a single backing disk it will pass through barriers. Still not for
    the multiple disk case which is questionable.

  16. Vadim says:

    Andi,

    This cost is way too big. In this case RAID 10 on 8 disks + BBU is cheaper and gives much better results.
    That simply means SSD can’t be used as media for high performance durable databases.

  17. Why not just adding an UPS to ur system.

  18. B says:

    @ Justmy2cents,

    I totally agree. Who isn’t running UPS’s on their DB servers? Even an old one that could hold the system on for 30 seconds would be enough, as long as further transactions didn’t keep coming in during that time.

    B

  19. Mark Callaghan says:

    When will Amazon provide a UPS on EC2 servers?

  20. A UPS isn’t the whole solution. What happens when your power supply (the one inside the server) fails, for example? “Keep the power from going off” and “keep the server from crashing” are not the same thing as “I want this device not to lie when I ask for this data to be written to durable storage.”

  21. B says:

    True, but even lower cost servers come with redundant power supplies as an option. I certainly wouldn’t specify a server for mission critical database data without redundant PSU’s.

    B

  22. Jignesh presented on a similar topic at the PostgreSQL conference this weekend. Slides are at http://blogs.sun.com/jkshah/entry/effects_of_flash_ssd_on

  23. Oh, and the broad consensus in the room was “these things are worth a lot less if they lie to the operating system about durability, and a UPS is not an acceptable workaround” :-)

  24. Somewhat after the fact I realized that this information relates closely with the SSD tests I did with PostgreSQL, reported here: . With a plain dd test I get about 50% performance loss when turning the write cache off on the X25-E. That’s much better than the more than fourfold loss you are observing (if I parse your numbers right). I’m planning to do a pgbench test later, which might provide more insight.

  25. Peter [Eisentraut], I’ve been following your benchmark blog posts with a lot of interest, too :-)

  26. andor says:

    Hi Vadim,

    I know that you published your thread long time back.. but I have a really really basic question. How do you turn off write cache on X25-E? I’ve searched everywhere how to do it but could only find studies showing performance drop after doing it. I used sdparm in the following fashion:

    sudo sdparm -s WCE=0 -S /dev/sdc

    but get an error like this:

    /dev/sdc: ATA SSDSA2SH032G1GN 045C
    change_mode_page: mode page indicates it is not savable but
    ‘–save’ option given (try without it)

    The error seems to suggest we cannot switch the X25-E’s cache.. but we know thats not the case.

    I am asking this question on the wrong forum I believe, but I thought you might have a ready solution for me..

    Thanks
    andor.

  27. Vadim says:

    Andor,

    hdparm -W 0 /dev/sdb

    works for me.

    Output:

    /dev/sdb:
    setting drive write-caching to 0 (off)
    write-caching = 0 (off)

  28. andor says:

    Vadim, thats awesome.. that has did the trick for me! It says the cache is switched off .. I was tinkering around with the sdparm utility as I thought the device file ‘sdc’ starts with s. (one of my friends suggested this. He said that since we connected the SSD with a SAS controller, we should be using sdparm. I have no idea what this SAS controller is.) thanks a lot for the input.

  29. Robert says:

    I know this posting s quite old, but it would be great to know the firmware of the Intel X25-E SSD you tested ?

  30. Evan Jones says:

    Note that according to my tests, the Intel X25-M G2 loses data even when the write cache is disabled. I have not tested the X25-E, but I suspect this may also be an issue there? See the following for more info:

    http://evanjones.ca/intel-ssd-durability.html

  31. SAB says:

    Sorry for digging this post up, but did you tried those tests with Sysbench ? I made a few tests with write cache on and off but I don’t have a SSD to compare the results.

Speak Your Mind

*