SSD, XFS, LVM, fsync, write cache, barrier and lost transactions

PREVIOUS POST
NEXT POST

We finally managed to get Intel X25-E SSD drive into our lab. I attached it to our Dell PowerEdge R900. The story making it running is worth separate mentioning – along with Intel X25-E I got HighPoint 2300 controller and CentOS 5.2 just could not start with two RAID controllers (Perc/6i and HighPoint 2300). The problem was solved by installing Ubuntu 8.10 which is currently running all this system. Originally I wanted to publish some nice benchmarks where InnoDB on SSD outperforms RAID 10, but recently I faced issue which can make previous results inconsistent.

In short words using Intel SSD X25-E card with enabled write-cache (which is default and most performance mode) does not warranty storing all InnoDB transactions on permanent storage.
I am having some déjà vu here, as Peter was rolling this 5 years ago http://lkml.org/lkml/2004/3/17/188 regarding regular IDE disks, and I did not expect this question poping up again.

Long story is:
I started with puting XFS on SSD and running very primitive test with INSERT INTO fs VALUES(0) into auto-increment field into InnoDB table. InnoDB parameters are

Actually most interesting one are innodb_flush_log_at_trx_commit=1 and innodb_flush_method = O_DIRECT (I tried also default innodb_flush_method, with the same result), using innodb_flush_log_at_trx_commit=1 I expect to have all committed transactions even in case of system failure.

Running this test with default XFS setting I saw SSD was doing 50 writes / s, this is something so forced me to check results several times – come on, it’s SSD, we should have much more IO there. Investigations put me into barries/nobarriers parameters and with mounting -o nobarrier I got 5300 writes / s. Nice difference, and this is something we want from SSD.

Now to test durability I do plug off power from SSD card and check how many transactions are really stored – and there is second bumper – I do not see several last N commited transactions.

So now time to turn off write-cache on SSD – all transactions are in place now, but write speed is only 1200 writes / s, which is comparable with RAID 10

So in conclusion to warranty Durability with SSD we have to disable write-cache which can affect performance results significantly (I have no results on hands, but it is to be tested).

What about LVM there ? Well, we often recommend to use LVM for backup purposes (even recent results are bad, we have no good replacement yet) and I tried LVM under XFS. With write-cache ON and default mount options (i.e. with barrier) I have 5250 writes / s, this is because LVM ignores write barriers (see http://dammit.lt/2008/11/03/xfs-write-barriers/ ), but again with enable write-cache you may lose transactions.

So in final conclusion:
1. Intel SSD X25E is NOT reliable in default mode
2. To have durability we need to disable write cache ( with following performance penalty, how much we need to test yet)
3. Possible solution could be put SSD into RAID controller with battery-backup-ed write cache, but I am not sure what is good ones – another are for research
4. XFS without LVM is putting barrier option which decreases write performance a lot

PREVIOUS POST
NEXT POST

Comments

  1. says

    It would be very interesting to check with Intel what their stand is on this.

    This is what suppose to be “Enterprise” drives so it is very strange it comes in so unsafe option as a default. It would be also to check what floats on the SATA wire in this case – do these drive really ignore both “do not cache” flag which should be set for D_SYNC/O_DIRECT writes and cache flush which should be passed w fsync.

  2. Vadim says

    rocky,

    it’s very simple PHP scripts
    < ?php

    mysql_connect(":/tmp/mysql.sock","root","");

    mysql_query("use test");
    while(true) {
    mysql_query("INSERT INTO fs VALUES (0)");

    echo mysql_insert_id();
    echo "\n";

    }
    ?>

    for w/s you can see iostat -dx 5

  3. Sean says

    Vadim, do you have any tests related to read oriented queries, ie 90%+ using innodb and myisam?

    -sean

  4. Vadim says

    Sean,

    Are you asking about reads on SSD or in general ?
    We have a lot of numbers, just need to sort it out :)

  5. Sean says

    Hi Vadim,

    Yes, I’m interested in reads on SSD for both innodb and myisam, though primarily myisam given our environment houses static data (compressed tables). I’ve read the papers and had presentations from vendors, but have not had the opportunity to get some in-house. I’m curious to know how they performed in the described situation, but also the TCO, how to save on server hardware. Given SSD’s respond for reads within .2-.4 ms, does a server need 32,64,128G of memory anymore?

    Sean

  6. TS says

    @Vadim:

    2. To have durability we need to disable write cache. (Basically reducing random write IOPS from 5000+ to 1200ish.)

    Is this issue XFS specific or MySQL specific or is it hardware? Does it happen in EXT3 under Linux? If it is a file system issue or MySQL issue, then it is not too bad. If it is a hardware issue, basically Intel just did false advertising. The enterprise market requires write IOPS durability. If it is indeed a hardware issue, there is no reason why people would pay for the ridiculous price of the X25-E if you are forced to disable the write cache on it and accept 1200 random write IOPS compared to 5K+ random write IOPS. I can foresee a drive recall or at least a huge price reduction for the drive then followed by a PCB revision with supercapacitor backed PCB design.

    This issue must be broadcast to the entire SQL database community. I am sure a lot of people are indeed picking the X25-E on the DB servers right now and they might be risking their data. Preferably, Intel must give an official word on this issue too.

  7. Vadim says

    TS

    Currently I believe this is hardware issue, not filesystem or MySQL. It seems Intel X25E does not have battery to support write cache and just lose it at power failure. To claim it 100% we need to run some more experiments, but I would not put my database that require 100% durability on it for now.

  8. Andi Kleen says

    Sorry, but you realize that nobarrier is the likely cause for the data loss, right? With barriers
    XFS fsync (but not necessarily ext3 fsync) would wait for a write barrier on the log commit, and thus
    also for the data. O_SYNC might be different though.

    Basically you specified the “please go unsafe but faster” option and then complain that it is
    actually unsafe.

    I would recommend to do the power off test without nobarriers but write cache on.

    -Andi

  9. Vadim says

    Andi,

    I wrote that in post. With barrier and write cache we have 50 writes / s, which I consider “not just slower” but disaster which I would not put on production system.

  10. Andi Kleen says

    It’s the cost of hitting the media. Unsafety like you chose is always faster.

    BTW LVM has been fixed in 2.6.29: if you only have a single backing disk it will pass through barriers. Still not for
    the multiple disk case which is questionable.

  11. Vadim says

    Andi,

    This cost is way too big. In this case RAID 10 on 8 disks + BBU is cheaper and gives much better results.
    That simply means SSD can’t be used as media for high performance durable databases.

  12. B says

    @ Justmy2cents,

    I totally agree. Who isn’t running UPS’s on their DB servers? Even an old one that could hold the system on for 30 seconds would be enough, as long as further transactions didn’t keep coming in during that time.

    B

  13. says

    A UPS isn’t the whole solution. What happens when your power supply (the one inside the server) fails, for example? “Keep the power from going off” and “keep the server from crashing” are not the same thing as “I want this device not to lie when I ask for this data to be written to durable storage.”

  14. B says

    True, but even lower cost servers come with redundant power supplies as an option. I certainly wouldn’t specify a server for mission critical database data without redundant PSU’s.

    B

  15. says

    Oh, and the broad consensus in the room was “these things are worth a lot less if they lie to the operating system about durability, and a UPS is not an acceptable workaround” :-)

  16. says

    Somewhat after the fact I realized that this information relates closely with the SSD tests I did with PostgreSQL, reported here: . With a plain dd test I get about 50% performance loss when turning the write cache off on the X25-E. That’s much better than the more than fourfold loss you are observing (if I parse your numbers right). I’m planning to do a pgbench test later, which might provide more insight.

  17. andor says

    Hi Vadim,

    I know that you published your thread long time back.. but I have a really really basic question. How do you turn off write cache on X25-E? I’ve searched everywhere how to do it but could only find studies showing performance drop after doing it. I used sdparm in the following fashion:

    sudo sdparm -s WCE=0 -S /dev/sdc

    but get an error like this:

    /dev/sdc: ATA SSDSA2SH032G1GN 045C
    change_mode_page: mode page indicates it is not savable but
    ‘–save’ option given (try without it)

    The error seems to suggest we cannot switch the X25-E’s cache.. but we know thats not the case.

    I am asking this question on the wrong forum I believe, but I thought you might have a ready solution for me..

    Thanks
    andor.

  18. andor says

    Vadim, thats awesome.. that has did the trick for me! It says the cache is switched off .. I was tinkering around with the sdparm utility as I thought the device file ‘sdc’ starts with s. (one of my friends suggested this. He said that since we connected the SSD with a SAS controller, we should be using sdparm. I have no idea what this SAS controller is.) thanks a lot for the input.

  19. Robert says

    I know this posting s quite old, but it would be great to know the firmware of the Intel X25-E SSD you tested ?

  20. says

    Sorry for digging this post up, but did you tried those tests with Sysbench ? I made a few tests with write cache on and off but I don’t have a SSD to compare the results.

Leave a Reply

Your email address will not be published. Required fields are marked *