September 3, 2014

Inexpensive SSDs for Database Workloads

nexpensive SSDs for Database WorkloadsThe cost of SSDs has been dropping rapidly, and at the time of this writing, 2.5-drives have reached the 1TB capacity mark.  You can actually get inexpensive drives for as little as 60 cents per GB. Even inexpensive SSDs can perform tens of thousands of IOPs and come with 1.5M – 2M hous MTBF and a 5-year warranty: check out the Intel SC S3500 specs as an example. There is however one important factor you need to take into account when considering  SSDs as opposed to conventional hard drives – Write Endurance.

Many of us have heard about SSDs having limits in terms of how many writes SSDs can handle, many however assume this is what is already accounted for in the warranty period and so if the hard drives claim to have sequential write speed of 450MB/sec and a warranty of 5 years we expect it to be able to sustain this level of writes for the whole period of time…. Others may look at the computations with 50GB or 100GB per day, such as here, giving us an even better feel of 30-60 years of life time.

In addition to warranty you should be looking at “Write Endurance Rating” which is 450TBW (stands for TB Written) for the given model. Note this number may be defined a little bit differently as the amount of writes a drive needs to do internally will depend on the types of writes – sequential writes having less wear than random ones.

Endurance, not the performance, which might be already more than enough for your workload, is a key factor in which consumer-grade and the enterprise-grade SSDs are different. For example if we take a look at a bit larger 1.2TB FusionIO Card we see Write Endurance of 16.26PB which is some 35 times more – so handling the same amount of writes the card would live 35 times longer.

So how much of these 450TB is really used? Well if you look at the specified write rate of 450MB/sec it will just last for 1M seconds which is less than 2 weeks. This is a shockingly low number and I think few people realize how low it really is. If we look at the “Random IO” number instead which is specified at some 11K IOPs or we can check Vadim’s numbers which give us about 120MB/sec for 16KB writes typical for Innodb – we’re going to be looking at a couple of month of life times of the full write load.

There is more interesting math you can do. Those 100GB a day which sounds like a lot is just 1.2MB/sec on average – hundreds of times less than a peak write performance device can sustain. Now lets see what happens if I would like to have my hard drive with 450TBW last for 3 years. This means I can afford to write about 400GB/day or roughly 5MB/sec on average. Frankly as little as it sounds it is not that bad. If you’re looking at random writes for conventional hard drives they can do roughly 150 truly random writes per second which is just 2.5MB/sec using 16KB innodb page size. This means you can drive the SSD drive on the 2x random write performance conventional hard drive will peak out for 3 year before it is expected to fail.

What we can learn from these numbers? If you’re looking for inexpensive SSDs as replacement for conventional drives for database workloads you really need to be looking at endurance and the write rate for your workload. For these kinds of SSDs you should be looking at them as able to provide you with very fast reads, being able to handle burst writes of pretty high volumes and being able to sustain reasonable life time with medium write load (though still better than conventional hard drives). This is quite good already.

So how can you evaluate whenever your workload fits well for such SSDs?  Measure the amount of writes for a day or week (many monitoring applications can show you that) to see how much you are really writing and use it to see how much life time you can get with such a write ratio. You can also use this number to see how much more of a load (compared to conventional hard drives) you can put on the system – remember chances are you will be endurance and life time limited more than performance limited. If your writes are really high, measured in TBs a day or higher, you will be better off with more expensive enterprise storage drives that have a much higher endurance rating.

Specifically if you have Percona Monitoring Plugins already installed you can use the rrdtool to get the data for long period of time by running something like rrdtool fetch rrddata.rrd AVERAGE –start -1w | awk ‘{ SUM += $2 } END {print SUM }’

The second thing you should do is actually measure reality. Because of the Amplification of writes inside SSDs depend on workload, the actual amount of writes the system handles might be more or less. Many vendors provide some SMART attributes that provide you with insight into how the drive is doing. For example Samsung exposes ID 177 as Wear Level Count where Crucial/Micron expose S.M.A.R.T ID 173 as “Average Block Erase Count”. These are very helpful both for checking how drives are doing as well as planning purposes.

Lets finally look at the database – anything we can do to reduce amount of writes our database is doing supporting our application?
Yes! There are both things you can do at application level side as well as on database side.

Optimizing Application to reduce writes to SSD
There are many things you can do from application side to reduce writes to SSD, here are some examples:

Data Clustering – use Innodb’s clustering by primary key to your advantage so when you do modification as small number of pages as possible is modified.
Normalization – Normalizing data often means you just change one row in one place instead of many rows in many places as it happens with de-normalized environments. SSDs have much faster random reads so more normalized data structures may perform well.
Buffering - Buffer updates in the application or some in memory store where possible to write back to the database periodically. Redis, memcached are common solution for this.
Data Diet - Put your data on diet to see how you can use efficient data types, only needed indexes etc – less data to reflect same amount of information means less updates.

Optimizing Database to reduce writes to SSD
There are a lot of things you can do on the database size as well:

Mix HDD and SSD – HDD when paired with RAID with BBU (Write Cache) can perform writes very well if they go to the small area on disk or sequential. You can use it to put some of the data to hard drives instead of SSD to reduce amount of writes to SSD.
Store DoubleWrite Buffer on HDD - Double Write Buffer is a killer for SSDs essentially doubling amount of writes for data changed in your tablespaces. At the same time this is a very small area (few MB in size) which is constantly updated as such it can very well be on the HDD as long as you have the BBU Cache. In Percona Server you can keep Double Write Buffer in the Separate File while in any MySQL version you can simply put your system tablespace on HDD. For most workloads it will be small enough to be cached.
Store Transactional Logs on HDD – Innodb Transactional Logs is another candidate for HDD – they are written sequentially so HDDs+ RAID BBU cache may well keep up very well with your load.
Store Binary Logs on HDD – Binary Logs… another object which is read and written sequentially as such it does not need to be on SSD.
Temporary Space in Memory – Temporary space (tmpdir) can get a lot of writes and you do not want them to be hitting your SSD. The best is to use tmpfs file system which allocates it in memory (or swap file if there is not enough memory) unless you have extremely large temporary space needs.
Large Buffer Pool Size -  The larger your bufferpool hopefully more dirty pages you can keep and as such more writes will come to already dirty pages effectively eliminating IO to SSD. So even if SSDs means you often can do with smaller memory amounts to achieve performance you need you might still want memory to increase your SSD life time.
Large Innodb Log Files – The larger are your Innodb Log files the more you can delay writes and as such the more logical writes can be merged into the single physical write. SSDs are fast so you often can get crash recovery speed you need with a lot larger log files than you had on HDDs. The combined log files size of 10GB may well work for your application.
Innodb Compression – Innodb Compression reduces amount of data written to tablespace hence can increase life time. Note in some MySQL versions compression increases amount of space written to Innodb Log Files, in this case it is good combined with Innodb Log Files being on Hard drive.
Alternative Storage Engines – TokuDB seems to be offering higher compression ratio and more sequential writes so it might be better choice for your application from SSD life standpoint. It would be interesting to see some real life comparisons in how much writes both engine generate for same workload.

Summary
Depending on your application, inexpensive SSDs might be good fit for your workload, though enterprise-grade flash storage might be a good choice for very write-intensive applications even if it costs a lot more due to a much higher endurance capacity. Make sure you understand how many writes your application does and keep thinking about how to optimize your system to do less writes with SSDs – as reducing writes is a key to a longer SSD lifetime and as such lowers costs.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Nils says:

    Would be great to have some sort of persistent RAM Disk to store the double write buffer. All you’d really need is a small device with a bit of flash, similarly sized RAM and a capacitor with some logic to flush the data from RAM to Flash as soon as the power is cut. Hard drive(s) + RAID Controller + BBU can easily run you another $800, not to mention you’d need a larger case and it increases energy usage.

  2. James says:

    There is one, STEC’s ZeusRAM. However it is only 8GB.

  3. Nils says:

    8 GB should be more than sufficient for the double write buffer. The real show stopper is the price.

  4. Since most SSDs are using data deduplication I’m not sure if DoubleWrite Buffer on SSD is so bad or it is cost-free.

  5. Leif Walsh says:

    “It would be interesting to see some real life comparisons in how much writes both engine generate for same workload.”

    This is a purely synthetic workload and it’s a comparison with MongoDB, not InnoDB, but it is a fairly straightforward comparison of B-trees and Fractal Trees in terms of cumulative IOPS (the second graph):
    http://www.tokutek.com/2013/06/iibench-benchmark-tokumx-vs-mongodb/

    I can’t think of any other graphs we have that measure this particular quantity, but Tim might have some.

  6. Jordan,

    Interesting question. Do you have any idea how we can test it ?
    From what I see from Flash vendors is – they would love to get rid of those doublewrite writes, for example FusionIO invested in directFS and special support for it in Percona Server and MariaDB at large extent to get doublewrite buffer out of the picture with guaranteed atomic writes.

  7. Leif: IOs is only part of the picture we need amount of data written too when it comes to SSDs. As I understand TokuDB does a lot larger IO size compared to Innodb. I would expect even more so for MongoDB. But yes there is the difference it would be good to see how much…. Also I’d love to see data for more than inserts. I’d expect updates to be really interesting.

  8. Justin Swanhart says:

    Putting SSD on slaves can be a great way to improve performance without risking your data or your investment. Write it to traditional HDD RAID10 with a battery backed write-back cache on the master.

    Putting the SSD on the slaves has the added benefit that they can sustain higher IO rates than the master, and catch up faster, especially with a cold buffer pool. On percona server, on the slaves you can move the doublewrite buffer, binary logs and redo logs over to the HDD and set innodb_flush_log_at_trx_commit=0 as long as you use crash-safe replication.

  9. Justin Swanhart says:

    We could get rid of the doublewrite buffer if old and new changes are in the log, right? The only reason it is needed is for torn pages and with old pages in the log that is a non-issue. This could be optional, you could use doublewrite, double-log or none.

  10. Gerry says:

    Peter, I believe this other benchmark on Tokutek’s site answers your question: http://www.tokutek.com/resources/benchmark-results/benchmarks-vs-innodb-flash/#WearLife.

    In this test we limited the number of database operations and measured the resulting IO on disk.

  11. Gerry,

    Thank you. Indeed very good results. I would love to see more benchmarks from TokuDB which include some updates and deletes (when it comes to writes) for example one very typical workload would be to have an INSERT ON DUPLICATE KEY UPDATE with say 80% updates 20% inserts which is very common case for “logging” applications to maintain summary data.

  12. Bradley C Kuszmaul says:

    I calculated not only the number of I/Os but the total number of bytes written. The reduction in number of writes is a factor of 17, so you are looking at *at least* a factor of 17 more write endurance.

    It’s probably actually better than that for inexpensive SSD. TokuDB writes relatively large blocks: if we assume that the writes are 1MB and that the erase block size for the SSD is close to that size, then the FTL can achieve far less write amplification. In contrast, when writing 16KB blocks to a an SSD with 1MB erase blocks, with say a 75% space utilization, then there is another factor of 4 of write amplification. As far as I can tell, the very high end devices, such as fusion I/O have huge effective erase blocks (maybe 1/4 GB), so you don’t get this particular advantage for a high-end device.

    The calculations were presented at
    http://www.tokutek.com/2012/09/three-ways-that-fractal-tree-indexes-improve-ssd-for-mysql/

  13. Bradley,

    Thank you 17x is very good number! I wonder whenever your large block sizes will increase amount of writes for random updates, would not they ?

  14. Another metric to consider is availability. There are a lot of great devices coming onto the market. It hasn’t been easy to find them when trying to purchase a few devices at a time.

  15. Bradley C Kuszmaul says:

    The benchmarks shown are iiBench, which induce random inserts into the indexes. Random updates would have essentially the same write bandwidth as random inserts: in TokuDB, an update is implemented by simply writing the new row, so it’s basically an insert. Is that what you are asking?

  16. Bradley,

    OK. Makes sense. Thank you for explanation.

  17. Bradley,

    Two questions to TokuDB.

    1. Some SSD on 1TB size we test have 8MB erase block size. Is it possible to adjust TokuDB to use 8MB IO ?

    2. What about transaction logs ? are not these also intensive but in much smaller size ?
    Is it possible for TokuDB to place transaction logs on a separate device ?

  18. Leif Walsh says:

    1. Yes, you can control this with tokudb_block_size, but remember compression will factor in after this number, tokudb_block_size is the uncompressed size.

    2. You can use tokudb_log_dir to put the logs on a different disk.

  19. Leif,

    So you do not have defined IO size? Just what it gets after compression of 1MB block ?

  20. Leif Walsh says:

    No, we take a defined uncompressed size, compress that, and write it, rather than compressing and maybe failing to hit the right size and having to compress again. This causes external fragmentation in some cases but the compression usually more than compensates. And the default uncompressed size is 4MB, not 1MB.

  21. Bradley Kuszmaul says:

    Transaction logs is an interesting question. For these device how much write amplification does sequential writing (with fsyncs) induce? If they have even a little bit of nvram they should be able to do ok. Our log files are 100MB and rotate out quickly if transactions are short lived.

  22. I need the Little Book on TokuDB to remember some of the interesting features it has. I assume these large block sizes are the block size for writes and still support smaller request sizes for reads (like 64kb).

  23. Leif Walsh says:

    Yep, tokudb_block_size is the write block size, tokudb_read_block_size is the read block size. It’s not little or a book, but this is in the user’s guide, http://www.tokutek.com/download.php?download_file=mysql-5.5.30-tokudb-7.0.4-users-guide.pdf page 10. It would be nice to have friendlier documentation for some of this stuff.

  24. Jon Clarke says:

    FlashSoft, by SanDisk, is a 100% software product that uses any solid state device to cache both reads and writes on the server. The Enterprise Strategy Group got impressive results in their lab validation report summary (on VMware) here: http://www.sandisk.com/assets/docs/esg-lab-summary-sandisk-flashsoft.pdf

  25. On today’s McAfee & Intel DAY conference in Kyiv speaker from Intel also suggested 3350 model, but he noticed that it’s very limited edition and it’s hard to get.

Speak Your Mind

*