Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

On SSDs – Lifespans, Health Measurement and RAID

October 12, 2012

Author

Ovais Tariq

MySQL

Share this Post:

Solid State Drive (SSD) have made it big and have made their way not only in desktop computing but also in mission-critical servers. SSDs have proved to be a break-through in IO performance and leave HDD far far behind in terms of Random IO performance. Random IO is what most of the database administrators would be concerned about as that is 90% of the IO pattern visible on database servers like MySQL. I have found Intel 520-series and Intel 910-series to be quite popular and they do give very good numbers in terms of Random IOPS. However, its not just performance that you should be concerned about, failure predictions and health gauges are also very important, as loss of data is a big NO-NO. There is a great deal of misconception about the endurance level of SSD, as its mostly compared to rotating disks even when measuring endurance levels, however, there is a big difference in how both SSD and HDD work, and that has a direct impact on the endurance level of SSD.

I will mostly be taling about MLC SSD, now let’s start off with a SSD primer.

SSD Primer

The smallest unit of SSD storage that can be read or written to is a page which is typically 4KB or 8KB in size. These pages are typically organized into blocks which are between 256KB or 1MB in size. SSDs have no mechanical parts and no heads or anything and their is no seeks needed as in conventional rotating disks. Reads involve reading pages from the SSD, however its the writes that are more tricky. Once you write to a page on SSD, you cannot simply overwrite (if you want to write new data) it in the same way you do with a HDD. Instead, you must erase the contents and then write again. However, a SSD can only do erasures at the block level and not the page level. What this means is that the SSD must relocate any valid data in the block to be erased, before the block can be erased and have new data written to it. To summarize, writes mean erase+write. Nowadays, SSD controllers are intelligent and do erasures in the background, so that the latency of the write operation is not affected. These background erasures are typically done within a process known garbage collection. You can imagine if these erasures were not done in the background, then writes would be too slow.

Of course every SSD has a lifespan after which it can be seen as unusable, let’s see what factors matter here.

SSD Lifespans

The lifespan of blocks that make up a SSD is really the number of times erasures and writes can be performed on those blocks. The lifespan is measure in terms of erase/write cycles. Typically enterprise grade MLC SSDs have a lifespan of about 30000 erase/write cycles, while consumer grade MLC SSD have a life span of 5000 to 10000 erase/write cycles. This fact makes it clear that the lifespan of a SSD depends on how much time it is written to. If you have a write-intensive workload then you should expect the SSD to fail much more quickly, in comparison to a read-heavy workload. This is by design.
To offset this behaviour of writes reducing the life of a SSD, engineers use two techniques, wear-levelling and over-provisioning. Wear-levelling works by making sure that all the blocks in a SSD are erased and written to in a evenly distributed fashion, this makes sure that some blocks do not die quickly then other blocks. Over-provisioning SSD capacity is one another technique that increases SSD endurance. This is accomplished by having a large population of blocks to distribute erases and writes over time (bigger capacity SSD), and by providing a large spare area. Many SSD models over provision the space, for example a 80GB SSD could have 10GB of over-provisioned space, so that while it is actually 90GB in size it is reported as a 80GB SSD. While this over-provisioning is done by the SSD manufacturers, this can also be done by not utilising the entire SSD, for example partitioning the SSD in such a way that you only partition about 75% to 80% of the SSD and leave the rest as RAW space that is not visible to the OS/filesystem. So while over-provisioning takes away some part of the disk capacity, it gives back in terms of increased endurance and performance.

Now comes the important part of the post that I would like to discuss.

Health Measurement and failure predictability

As you may have noticed after reading the above part of this post, its all the more important to be able to predict when a SSD would fail and to be able to see health related information about the SSD. Yet I haven’t found much written about how to gauge the health of a SSD. RAID controllers employed with SSD tend to be very limited in terms of the amount of information that they provide about an SSD that could allow predicting when a SSD could fail. However, most of the SSD provide a lot of information via S.M.A.R.T. and this can be leveraged to good affect.
Let’s consider the example of Intel SSD, these SSD have to S.M.A.R.T. attributes that can be leveraged to predict when the SSD would fail. These attributes are:

- Available_Reservd_Space: This attribute reports the number of reserve blocks remaining. The value of the attribute starts at 100, which means that the reserved space is 100 percent available. The threshold value for this attribute is 10 which means 10 percent availability, which indicates that the drive is close to its end of life.

- Media_Wearout_Indicator: This attribute reports the number of erase/write cycles the NAND media has performed. The value of the attribute decreases from 100 to 1, as the average erase cycle count increases from 0 to the maximum rated cycles. Once the value of this attribute reaches 1, the number will not decrease, although it is likely that significant additional wear can be put on the device. A value of 1 should be thought of as the threshold value for this attribute.

Using the smartctl tool (part of the smartmontools package) we can very easily read the values of these attributes and then use it to predict failures. For example for SATA SSD drives attached to a LSI Megaraid controller, we could very easily read the values of those attributes using the following bash snippet:

Available_Reservd_Space_current=$(smartctl -d sat+megaraid,${device_id} -a /dev/sda | grep "Available_Reservd_Space" | awk '{print $4}')
Media_Wearout_Indicator_current=$(smartctl -d sat+megaraid,${device_id} -a /dev/sda | grep "Media_Wearout_Indicator" | awk '{print $4}')

1 2	Available_Reservd_Space_current=$(smartctl -d sat+megaraid,${device_id} -a /dev/sda \| grep "Available_Reservd_Space" \| awk '{print $4}') Media_Wearout_Indicator_current=$(smartctl -d sat+megaraid,${device_id} -a /dev/sda \| grep "Media_Wearout_Indicator" \| awk '{print $4}')

Then the above information can be used in different fashions, we could raise an alert if its nearing the threshold value, or measure how quickly the values decrease and then use the rate of decrease to estimate when the drive could fail.

SSDs and RAID levels

RAID have been typically with HDD used for data protection via redundancy and for increased performance, and they have found their use with SSD as well. Its common to see RAID level 5 or 6 being used with SSD on mixed read/write workloads, because the write penalty visible by using these level with rotating disks, is not of that extent when talking about SSD because there is no disk seek involved, so the read-modify-write cycle typically involved with parity based RAID levels does not cause a lot of performance hit. On the other hand striping and mirroring does improve the read performance of the SSD a lot and redundant arrays using SSD deliver far better performance as compared to HDD arrays.
But what about data protection? Do the parity-based RAID levels and mirroring provide the same level of data protection for SSDs as they are thought of? I am skeptical about that, because as I have mentioned above the endurance of a SSD depends a lot on how much it has been written to. In parity-based RAID configurations, a lot of extra writes are generated because of parity changes and they of course decrease the lifespan of the SSD, similarly in the case of mirroring, I am not sure it can provide any benefit in case of wearing out of SSD, if both the SSD in the mirror configuration have the same age, why? Because in mirroring both the SSDs in the array would be receiving the same amount of writes and hence the lifespan would decrease at the same amount of time.
I would think that there is some drastic changes that are needed to the thought process when thinking of data protection and RAID levels, because for me parity-based configuration or mirroring configuration are not going to provide any extra data protection in cases where the SSD used are of similar ages. It might actually be a good idea to periodically replace drives with younger ones so as to make sure that all the drives do not age together.

I would like to know what my readers think!

0 0 votes

Article Rating

13 Comments

Oldest

Newest Most Voted

Baruch Even

13 years ago

You assume that all the disks will fail at the same time however even at the end of life the drives are more likely to fail in a staggered form since the cycle counts are only the minimum time guarantees but not the upper bound real-life case. A page/block is considered failed when that page or a certain number of pages in the block fail to be read, written or erased. That doesn’t happen all once due to various considerations, some are environmental (writing when the ssd is too hot or too cold f.ex.) and some are manufacturing related due to defects or inconsistencies across the wafer. That’s also why you see some blocks failing before others in the first place, it’s not just the raw write/erase cycles it the inherent attributes of that block.

It should also be noted that the page is considered failed not when it can’t program some bit but when the ECC can’t correct enough bits so you may go around with some failed bits for a long time before dropping that page.

On top of that there are earlier failures that happen in the guaranteed period, these do happen and without RAID you’d lose the data very early on.

All that said, there is room to consider what happens to a RAID group at its end of life, the monitoring of SSDs should definitely be constructed so that if too many SSDs show signs of nearing their end of life then proactive maintenance is in order and some of them should be replaced with newer ones.

One thing to add to the monitoring logic that you present, SMART itself is not sufficient, the real metric is the disk latency measures. If you are not monitoring that you are far more likely to have an unexpected disk failure. I’m not aware of an open study on this with regard to SSDs but the Google study on HDDs showed that SMART was not very useful. And my gut feeling and experience so far showed few failures from SSDs getting to their end of life and more failing due to other reasons and most of these were not predicted by SMART.

Author

Ovais Tariq

13 years ago

Reply to Baruch Even

Baruch,

You have raised some very interesting points. However, I still think that SSD lifespan can still be gauged by measuring the number of write and erase cycles. Of course this will not be 100% accurate, but still it will provide a good approximation. Every SSD has a lifespan that can defined as the number of writes that it can sustain. So that should be a good number to approximate when the drive could fail. Of course there may be other reasons of drive failures such as unexpected hardware failure and in that case RAID would certainly help. But my point is that considering that the SSDs attached to a controller are all being accessed and written to at the same time, then its more than likely that they will fail nearly at the same time. And that to me is the concerning factor when using raid levels such as RAID 1. You are right there are other environmental factors that should be taken into account. But when considering disks attached to the same controller, I do not think the environmental factor will come into account separately for each disk as they would all be operating under the same environment.

Regarding SMART, its by no means a very perfect way to monitor the health of a disk, but when you do not have other reliable information, SMART gives enough information that can be used to predict when the life span of a SSD is nearing its end.

David Burg

13 years ago

It is not a good idea to need to periodically replace drives. Replacing a drive infer that you pay labor to go touch your system’s hardware. Not to mention that you may further have down time unless you have an hot-swap RAID (add $$$). On the contrary you want to reduce labor cost to a minimum, especially when you are looking for scale (-out).
Yes, RAID is a spining-disk-focused technology. It needs some amount of rethinking to efficienctly use and improve SSDs.

Author

Ovais Tariq

13 years ago

Reply to David Burg

David,

Yes reducing cost is important but so is data protection important. And I agree RAID needs some sort of rethinking to be effectively used with SSDs

Carl Woodland

13 years ago

Great information, and comments. Its nice to have this forum to exchange ideas. So, to that point…a what if …. what if your SSDs lasted longer than your server or storage subsystem that the SSD was installed in?? What if, you could write to a 24nm MLC SSD 10-25..or more, full capacity writes per day for 5 years…and all of that with sustained, consistent performance…….garunteeeeeed! To accomplish this, the SSD will need to employ Advanced Digital Signal Processing, have DataPath Protection, industry leading Wear-Leveling and Garbage Collection algorithms that are smart about how to do all this work, and still maintain the performance you bought this drive to deliver in the first place. Comments and replies welcomed! If you choose to comment off this forum, I am easy to find on LinkedIn…

Author

Ovais Tariq

13 years ago

Carl,

I assume that you are suggesting that the company you work for, manufacture SSDs that meet the specifications you mentioned. If such is true, it would be good to lay a hand on it and test drive 🙂

Sean

13 years ago

Ovais, this is one of the clearest posts I’ve seen about SSD and their lifespans. Thank you. I wasn’t aware that there is a good chance of both drives failing around the same time.

I’m building a new server for a start-up client and going with RAID-1 SSD for the DB and OS. Wouldn’t a simple solution be to rotate your hot spare a couple months in? This seems a logical step to increase data safety given that one SSD’s end of life is couple months ahead of the other.

Author

Ovais Tariq

13 years ago

Hi Sean,

Hot spare would only be used when one of the drives in the RAID 1 array fails. In which case you would already have the full life of the SSD available, so rotating hot spares over a couple of months, in my opinion, is not going to give any additional benefit. The lifespan of the SSDs is governed by how many writes it has received and not strictly by its age.

theunafraid

13 years ago

How about using a single hdd(for parity or whatever) with the rest ssd, for this one would probably need special drivers right?

Author

Ovais Tariq

13 years ago

Reply to theunafraid

Why would you want to do that? It would still act as a bottleneck undermining the performance of the SSDs. If you are using a parity based RAID level (level 5 or 6) then every write means reading not only the data block but also the parity block, therefore storing the parity on HDD would cause an overall impact on write performance.

Keith Josephson

12 years ago

My company has offered an all-SSD server for more than three years, and I have had the opportunity to spend a lot of time benchmarking SSD RAID and evaluating both hte results and the impact on SSDs. Long term testing has all been done with Intel SSDs, first from the X-25M (MLC) and X-25E (SLC), and then from various MLC families including i320, i520 and now the S3500 series.

Over-provisioning definitely makes a difference in both longevity and sustained performance. ION over-provisions by 20%. At this level of OP, our systems have encountered no wear-out failures and, indeed, no drives that report less than 90% on the S.M.A.R.T. Media WearOut Indicator.

We have given quite a bit of thought to the appropriate RAID level for SSD arrays. RAID 5 does mean that any time a random write (of less than or equal to stripe size) occurs, that two SSDs will be written to. Given suffciently random writes, that amounts to an effective doubling of the write burden on each of the SSDs in the array. For larger writes, the effect is less. RAID 6 is worse, with the I/O burden tripled. On RAID 10, the burden is double for all I/O sizes.

While RAID 5 on SSD suffers from a read/modify/write impact like it does on rotating disks, the time to read a stripe is much less than write time, so the impact is not nearly as bad. Compared to write performance on SSD RAID 10 or SSD RAID 0, there is definitely a performance hit, but compared to ANY RAID on rotating disks, the performance is astounding.

It is important to monitor the endurance remaining on SSDs in RAID configurations, but with proper provisioning and selection of enterprise SSDs,, both performance and endurance with SSD RAID 5 are quite good.

Ultim

10 years ago

If it is interesting, you can learn about removable hard drives in this article https://hetmanrecovery.com/recovery_news/predicting-ssd-failures-ssd-specific-smart-values.htm , but not to read tea leaves that to what – why unnecessary disputes))