Death match! EBS versus SSD price, performance, and QoS

Is it a good idea to deploy your database into the cloud? It depends. I have seen it work well many times, and cause trouble at other times. In this blog post I want to examine cloud-based I/O. I/O matters a lot when a) the database’s working set is bigger than the server’s memory, or b) the workload is write-heavy. If this is the case, how expensive is it to get good performance, relative to what you get with physical hardware? Specifically, how does it compare to commodity solid-state drives? Let’s put them in the ring and let them duke it out.

I could do benchmarks, but that would not be interesting — we already know that benchmarks are unrealistic, and we know that SSDs would win. I’d rather look at real systems and see how they behave. Are the theoretical advantages of SSDs really a big advantage in practice? I will show the performance of two real customer systems running web applications.

Let’s begin with a system running in a popular hosting provider’s datacenter. This application is a popular blogging service, running on a generic midrange Dell-class server. The disks are six OCZ-VERTEX2 200-GB drives in a RAID10 array, with an LSI MegaRAID controller with a BBU. These disks currently cost about $400 each, and are roughly half full. (That actually matters — the fuller they get, the slower they are.) So let’s call this a $2500 disk array, and you can plug that into your favorite server and hosting provider costs to see what the CapEx and OpEx are for you. Assuming you will depreciate this over 3 years, let’s call this a $1000 per year storage array, just to make it a round number. These aren’t the most reliable disks in my experience, and you are likely to need to replace one, for example. If you rent this array instead of buying it, the cost is likely to be quite a bit higher.

Now, let’s look at the performance this server is getting from its disks. I’m using the Aspersa diskstats tool to pull this data straight from /proc/diskstats. I’ll aggregate over long periods of time in my sample file so we can see performance throughout the day. If you’re not familiar with the diskstats tool, the columns are the number of seconds in the sample, the device name, and the following statistics for reads: MB/s, average concurrency, and average response time in milliseconds. The same statistics are repeated for writes, and then we have the percent of time the device was busy, and the average number of requests in progress at the time the samples were taken.

So we’re reading 20-30 MB/s from these disks, with average latencies generally under a millisecond, and we’re writing a few MB/s with latencies about the same. The device is active about 40% to 50% of the time, but given that we know there are 3 pairs of drives behind it, the device wouldn’t be 100% utilized until average read concurrency reached 6 and write concurrency reached 3. One of the samples shows the performance during a period of pretty high utilization. (Note: A read concurrency of 2.7 for a 6-device array, plus a few writes that have to be sent to both devices in a mirrored pair, is roughly 50% read utilized. This is one of the reasons I wrote the diskstats tool — you can understand busy-ness and utilization correctly. The %util that is displayed by iostat is confusing, and you have to do some tedious math to get something approximating the real device utilization, but reads and writes aren’t broken out, so you can’t actually understand performance and utilization in this level of detail from the output of iostat.)

I would characterize this as very good performance. Sub-millisecond latencies to disk, pretty much consistently, for reads and writes, is very good. I’m averaging across large periods of time here, so you can’t really see it, but there are significant spikes of load on the server during these times, and the disks keep responding in less than 2ms. I could zoom in to the 30-second level or 1-second level and show you that it remains the same from second to second. I could slice and dice this data all different ways, but it would be boring, because it looks the same from every angle.

Now let’s switch to a different customer’s system. This one is a popular online retail store, running in the Amazon cloud. It’s a quadruple extra large EC2 server (currently priced at $2/hour for a non-reserved instance) with a software RAID10 array of EBS volumes. As time has passed, they’ve added more EBS volumes to keep up with load. Between the time I sampled statistics last week and now, they went from a 10-volume array to a 20-volume array, for example. But the samples I’ll show are from an array of 10 x 100GB EBS volumes. EBS currently lists at $0.10 per GB-month of provisioned storage, which I calculate should cost $100 per month. I grabbed the counters from /proc/diskstats and computed I/O operations done so far on this array, at the list price of $0.10 per 1 million I/O requests, to be $126. This is over a long period of time, so the counters could have wrapped, but let’s assume not. So this disk array might cost $1500 or so per year. What do we get for that? Let’s look at some second-by-second output of the diskstats tool during a 30-second period, aggregated over all of the 10 EBS volumes:

In terms of throughput, we’re getting a couple megabytes per second of reads, and generally less than a megabyte per second of writes, but the performance (latency) is both large and variable even though the devices are idle most of the time. In some time periods, the write latency gets into the 30s of milliseconds, and the read latency goes above 130 milliseconds. That is average over a period of one second per sample, and over 10 devices all aggregated together in each line.

I can switch the view of the same data to look at it disk-by-disk, aggregated over the entire 30 seconds that I captured samples of /proc/diskstats. Here are the statistics for the EBS volumes over that time period:

So over the 30-second period (shown as {29} in the output because the first sample is merely used as a baseline for subtracting from other samples), we read and wrote about half a megabyte per second from each volume, and got read latencies varying from the teens to the seventies of milliseconds, and write latencies in the teens. Note how variable the quality of service from these EBS volumes is — some are fast, some are slow, even though we are asking the same thing from all of them (I wrote on my own blog about the reasons for this and the wrench it throws into capacity planning). If we zoom in on a particular sample — say the sample taken at a 23.2 second delta since the beginning of the period — we can see non-aggregated statistics:

During that time period, two of these devices were responding in worse than 160 and 170 milliseconds, respectively. One final zoom-in on a specific device, and I’ll stop belaboring the point. Let’s look at the performance of sdj1 over the entire sample period: