Where the open source database community meets: Use code PERCONA75 and secure your spot for Percona Live. Register

Downloads

Blog

SSD, XFS, LVM, fsync, write cache, barrier and lost transactions

March 3, 2009

Author

Vadim Tkachenko

Hardware and Storage

Share this Post:

We finally managed to get Intel X25-E SSD drive into our lab. I attached it to our Dell PowerEdge R900. The story making it running is worth separate mentioning – along with Intel X25-E I got HighPoint 2300 controller and CentOS 5.2 just could not start with two RAID controllers (Perc/6i and HighPoint 2300). The problem was solved by installing Ubuntu 8.10 which is currently running all this system. Originally I wanted to publish some nice benchmarks where InnoDB on SSD outperforms RAID 10, but recently I faced issue which can make previous results inconsistent.

In short words using Intel SSD X25-E card with enabled write-cache (which is default and most performance mode) does not warranty storing all InnoDB transactions on permanent storage.
I am having some dÃ©jÃ vu here, as Peter was rolling this 5 years ago http://lkml.org/lkml/2004/3/17/188 regarding regular IDE disks, and I did not expect this question poping up again.

Long story is:
I started with puting XFS on SSD and running very primitive test with INSERT INTO fs VALUES(0) into auto-increment field into InnoDB table. InnoDB parameters are

innodb_buffer_pool_size=3G
innodb_data_file_path=ibdata1:10M:autoextend
innodb_file_per_table=1
innodb_log_buffer_size=8M
innodb_log_files_in_group=2
innodb_log_file_size=256M
innodb_thread_concurrency=0
innodb_flush_log_at_trx_commit=1
innodb_flush_method             = O_DIRECT

innodb_buffer_pool_size=3G

innodb_data_file_path=ibdata1:10M:autoextend

innodb_file_per_table=1

innodb_log_buffer_size=8M

innodb_log_files_in_group=2

innodb_log_file_size=256M

innodb_thread_concurrency=0

innodb_flush_log_at_trx_commit=1

innodb_flush_method = O_DIRECT

Actually most interesting one are innodb_flush_log_at_trx_commit=1 and innodb_flush_method = O_DIRECT (I tried also default innodb_flush_method, with the same result), using innodb_flush_log_at_trx_commit=1 I expect to have all committed transactions even in case of system failure.

Running this test with default XFS setting I saw SSD was doing 50 writes / s, this is something so forced me to check results several times – come on, it’s SSD, we should have much more IO there. Investigations put me into barries/nobarriers parameters and with mounting -o nobarrier I got 5300 writes / s. Nice difference, and this is something we want from SSD.

Now to test durability I do plug off power from SSD card and check how many transactions are really stored – and there is second bumper – I do not see several last N commited transactions.

So now time to turn off write-cache on SSD – all transactions are in place now, but write speed is only 1200 writes / s, which is comparable with RAID 10

So in conclusion to warranty Durability with SSD we have to disable write-cache which can affect performance results significantly (I have no results on hands, but it is to be tested).

What about LVM there ? Well, we often recommend to use LVM for backup purposes (even recent results are bad, we have no good replacement yet) and I tried LVM under XFS. With write-cache ON and default mount options (i.e. with barrier) I have 5250 writes / s, this is because LVM ignores write barriers (see http://dammit.lt/2008/11/03/xfs-write-barriers/ ), but again with enable write-cache you may lose transactions.

So in final conclusion:
1. Intel SSD X25E is NOT reliable in default mode
2. To have durability we need to disable write cache ( with following performance penalty, how much we need to test yet)
3. Possible solution could be put SSD into RAID controller with battery-backup-ed write cache, but I am not sure what is good ones – another are for research
4. XFS without LVM is putting barrier option which decreases write performance a lot

0 0 votes

Article Rating

33 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Admin

Peter Zaitsev

17 years ago

It would be very interesting to check with Intel what their stand is on this.

This is what suppose to be “Enterprise” drives so it is very strange it comes in so unsafe option as a default. It would be also to check what floats on the SATA wire in this case – do these drive really ignore both “do not cache” flag which should be set for D_SYNC/O_DIRECT writes and cache flush which should be passed w fsync.

rocky

17 years ago

What kind of tool did you for testing writes ? and parameters ?

Author

Vadim Tkachenko

17 years ago

rocky,

it’s very simple PHP scripts

for w/s you can see iostat -dx 5

Phil

17 years ago

Did you try aligning the blocks in the FS to the blocks of the SSD?

http://thunk.org/tytso/blog/2009/02/20/aligning-filesystems-to-an-ssds-erase-block-size/

Sean

17 years ago

Vadim, do you have any tests related to read oriented queries, ie 90%+ using innodb and myisam?

-sean

Author

Vadim Tkachenko

17 years ago

Phil,

No I did not try that – do you think it will help with fsync ?

Author

Vadim Tkachenko

17 years ago

Sean,

Are you asking about reads on SSD or in general ?
We have a lot of numbers, just need to sort it out 🙂

Sean

17 years ago

Hi Vadim,

Yes, I’m interested in reads on SSD for both innodb and myisam, though primarily myisam given our environment houses static data (compressed tables). I’ve read the papers and had presentations from vendors, but have not had the opportunity to get some in-house. I’m curious to know how they performed in the described situation, but also the TCO, how to save on server hardware. Given SSD’s respond for reads within .2-.4 ms, does a server need 32,64,128G of memory anymore?

Sean

Theodore Tso

17 years ago

FYI, starting in 2.6.29, LVM will start respecting write barriers. (finally!)

Mark Callaghan

17 years ago

‘@Sean — there is a great paper on where to spend money (RAM, Flash, disk) and it address your question — spend more on Flash and less on RAM. I have not seen anyone publish benchmark results based on the ideas in the paper — http://www.cs.cmu.edu/~damon2007/pdf/graefe07fiveminrule.pdf

17 years ago

‘@Vadim Tkachenko:

2. To have durability we need to disable write cache. (Basically reducing random write IOPS from 5000+ to 1200ish.)

Is this issue XFS specific or MySQL specific or is it hardware? Does it happen in EXT3 under Linux? If it is a file system issue or MySQL issue, then it is not too bad. If it is a hardware issue, basically Intel just did false advertising. The enterprise market requires write IOPS durability. If it is indeed a hardware issue, there is no reason why people would pay for the ridiculous price of the X25-E if you are forced to disable the write cache on it and accept 1200 random write IOPS compared to 5K+ random write IOPS. I can foresee a drive recall or at least a huge price reduction for the drive then followed by a PCB revision with supercapacitor backed PCB design.

This issue must be broadcast to the entire SQL database community. I am sure a lot of people are indeed picking the X25-E on the DB servers right now and they might be risking their data. Preferably, Intel must give an official word on this issue too.

Author

Vadim

17 years ago

Currently I believe this is hardware issue, not filesystem or MySQL. It seems Intel X25E does not have battery to support write cache and just lose it at power failure. To claim it 100% we need to run some more experiments, but I would not put my database that require 100% durability on it for now.

Andi Kleen

17 years ago

Sorry, but you realize that nobarrier is the likely cause for the data loss, right? With barriers
XFS fsync (but not necessarily ext3 fsync) would wait for a write barrier on the log commit, and thus
also for the data. O_SYNC might be different though.

Basically you specified the “please go unsafe but faster” option and then complain that it is
actually unsafe.

I would recommend to do the power off test without nobarriers but write cache on.

-Andi

Author

Vadim Tkachenko

17 years ago

Andi,

I wrote that in post. With barrier and write cache we have 50 writes / s, which I consider “not just slower” but disaster which I would not put on production system.

Andi Kleen

17 years ago

It’s the cost of hitting the media. Unsafety like you chose is always faster.

BTW LVM has been fixed in 2.6.29: if you only have a single backing disk it will pass through barriers. Still not for
the multiple disk case which is questionable.

Author

Vadim Tkachenko

17 years ago

Andi,

This cost is way too big. In this case RAID 10 on 8 disks + BBU is cheaper and gives much better results.
That simply means SSD can’t be used as media for high performance durable databases.

justmy2cents

17 years ago

Why not just adding an UPS to ur system.

17 years ago

‘@ Justmy2cents,

I totally agree. Who isn’t running UPS’s on their DB servers? Even an old one that could hold the system on for 30 seconds would be enough, as long as further transactions didn’t keep coming in during that time.

Mark Callaghan

17 years ago

When will Amazon provide a UPS on EC2 servers?

Baron Schwartz

17 years ago

A UPS isn’t the whole solution. What happens when your power supply (the one inside the server) fails, for example? “Keep the power from going off” and “keep the server from crashing” are not the same thing as “I want this device not to lie when I ask for this data to be written to durable storage.”

17 years ago

True, but even lower cost servers come with redundant power supplies as an option. I certainly wouldn’t specify a server for mission critical database data without redundant PSU’s.

Baron Schwartz

17 years ago

Jignesh presented on a similar topic at the PostgreSQL conference this weekend. Slides are at http://blogs.sun.com/jkshah/entry/effects_of_flash_ssd_on

Baron Schwartz

17 years ago

Oh, and the broad consensus in the room was “these things are worth a lot less if they lie to the operating system about durability, and a UPS is not an acceptable workaround” 🙂

http://petereisentraut.blogspot.com/

16 years ago

Somewhat after the fact I realized that this information relates closely with the SSD tests I did with PostgreSQL, reported here: . With a plain dd test I get about 50% performance loss when turning the write cache off on the X25-E. That’s much better than the more than fourfold loss you are observing (if I parse your numbers right). I’m planning to do a pgbench test later, which might provide more insight.

http://petereisentraut.blogspot.com/

16 years ago

I meant http://petereisentraut.blogspot.com/2009/07/solid-state-drive-benchmarks-and-write.html … funny markup …

Baron Schwartz

16 years ago

Peter [Eisentraut], I’ve been following your benchmark blog posts with a lot of interest, too 🙂

andor

16 years ago

Hi Vadim,

I know that you published your thread long time back.. but I have a really really basic question. How do you turn off write cache on X25-E? I’ve searched everywhere how to do it but could only find studies showing performance drop after doing it. I used sdparm in the following fashion:

sudo sdparm -s WCE=0 -S /dev/sdc

but get an error like this:

/dev/sdc: ATA SSDSA2SH032G1GN 045C
change_mode_page: mode page indicates it is not savable but
‘–save’ option given (try without it)

The error seems to suggest we cannot switch the X25-E’s cache.. but we know thats not the case.

I am asking this question on the wrong forum I believe, but I thought you might have a ready solution for me..

Thanks
andor.

Author

Vadim Tkachenko

16 years ago

Andor,

hdparm -W 0 /dev/sdb

works for me.

Output:

/dev/sdb:
setting drive write-caching to 0 (off)
write-caching = 0 (off)

andor

16 years ago

Vadim, thats awesome.. that has did the trick for me! It says the cache is switched off .. I was tinkering around with the sdparm utility as I thought the device file ‘sdc’ starts with s. (one of my friends suggested this. He said that since we connected the SSD with a SAS controller, we should be using sdparm. I have no idea what this SAS controller is.) thanks a lot for the input.

Robert

16 years ago

I know this posting s quite old, but it would be great to know the firmware of the Intel X25-E SSD you tested ?

Evan Jones

15 years ago

Note that according to my tests, the Intel X25-M G2 loses data even when the write cache is disabled. I have not tested the X25-E, but I suspect this may also be an issue there? See the following for more info:

http://evanjones.ca/intel-ssd-durability.html

SAB

14 years ago

Sorry for digging this post up, but did you tried those tests with Sysbench ? I made a few tests with write cache on and off but I don’t have a SSD to compare the results.

Angel Genchev

10 years ago

It’s interesting how does perform here the “Proffesional” SSD of Samsung – Samsung 850 pro w 3D V-NAND whic by 2014-2015 is among the best SSDs.

Resources

PostgreSQL

November 21, 2025

Anil Joshi

How to Deploy a Stand-By/Ad-Hoc Cluster Based on Percona Operator for PostgreSQL

PostgreSQL

October 31, 2025

Anil Joshi

How to Configure pgBackRest Backups and Restores in PostgreSQL (Local/k8s) Using a MinIO Object Store

MySQL

March 1, 2025

Rituja Borse

InnoDB Performance Optimization Basics

Far
Enough.

Said no pioneer ever.

Get Started

Open source database software from experts who stand with you in production. Forever free from lock-in and other corporate BS.

Connect

Privacy

Legal

Security Center

MySQL, PostgreSQL, InnoDB, MariaDB, MongoDB and Kubernetes are trademarks for their respective owners.

SSD, XFS, LVM, fsync, write cache, barrier and lost transactions

How to Deploy a Stand-By/Ad-Hoc Cluster Based on Percona Operator for PostgreSQL

How to Configure pgBackRest Backups and Restores in PostgreSQL (Local/k8s) Using a MinIO Object Store

InnoDB Performance Optimization Basics

Far Enough.

Far
Enough.