Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Death match! EBS versus SSD price, performance, and QoS

February 21, 2011

Author

Share this Post:

Is it a good idea to deploy your database into the cloud? It depends. I have seen it work well many times, and cause trouble at other times. In this blog post I want to examine cloud-based I/O. I/O matters a lot when a) the database’s working set is bigger than the server’s memory, or b) the workload is write-heavy. If this is the case, how expensive is it to get good performance, relative to what you get with physical hardware? Specifically, how does it compare to commodity solid-state drives? Let’s put them in the ring and let them duke it out.

I could do benchmarks, but that would not be interesting — we already know that benchmarks are unrealistic, and we know that SSDs would win. I’d rather look at real systems and see how they behave. Are the theoretical advantages of SSDs really a big advantage in practice? I will show the performance of two real customer systems running web applications.

Let’s begin with a system running in a popular hosting provider’s datacenter. This application is a popular blogging service, running on a generic midrange Dell-class server. The disks are six OCZ-VERTEX2 200-GB drives in a RAID10 array, with an LSI MegaRAID controller with a BBU. These disks currently cost about $400 each, and are roughly half full. (That actually matters — the fuller they get, the slower they are.) So let’s call this a $2500 disk array, and you can plug that into your favorite server and hosting provider costs to see what the CapEx and OpEx are for you. Assuming you will depreciate this over 3 years, let’s call this a $1000 per year storage array, just to make it a round number. These aren’t the most reliable disks in my experience, and you are likely to need to replace one, for example. If you rent this array instead of buying it, the cost is likely to be quite a bit higher.

Now, let’s look at the performance this server is getting from its disks. I’m using the Aspersa diskstats tool to pull this data straight from /proc/diskstats. I’ll aggregate over long periods of time in my sample file so we can see performance throughout the day. If you’re not familiar with the diskstats tool, the columns are the number of seconds in the sample, the device name, and the following statistics for reads: MB/s, average concurrency, and average response time in milliseconds. The same statistics are repeated for writes, and then we have the percent of time the device was busy, and the average number of requests in progress at the time the samples were taken.

[baron@ginger logstats]$ diskstats -g sample -i 6000 sda3-stats.txt
    #ts device rd_mb_s rd_cnc   rd_rt wr_mb_s wr_cnc   wr_rt busy in_prg
...snip...
54009.9 sda3      32.6    1.1     0.5     2.7    0.0     0.4  54%      0
60032.9 sda3      30.0    1.0     0.5     3.0    0.0     0.3  50%      1
66034.0 sda3      23.6    0.8     0.5     3.1    0.0     0.4  43%      1
72040.6 sda3      25.5    2.7     1.6     3.8    0.2     1.5  47%      1
78041.7 sda3      25.5    1.2     0.7     4.5    0.1     0.4  46%      0
84042.8 sda3      24.4    0.7     0.5     4.7    0.0     0.3  44%      0
90043.9 sda3      21.7    0.9     0.6     4.7    0.1     0.6  41%      0
...snip...

[baron@ginger logstats]$ diskstats -g sample -i 6000 sda3-stats.txt

#ts device rd_mb_s rd_cnc rd_rt wr_mb_s wr_cnc wr_rt busy in_prg

...snip...

54009.9 sda3 32.6 1.1 0.5 2.7 0.0 0.4 54% 0

60032.9 sda3 30.0 1.0 0.5 3.0 0.0 0.3 50% 1

66034.0 sda3 23.6 0.8 0.5 3.1 0.0 0.4 43% 1

72040.6 sda3 25.5 2.7 1.6 3.8 0.2 1.5 47% 1

78041.7 sda3 25.5 1.2 0.7 4.5 0.1 0.4 46% 0

84042.8 sda3 24.4 0.7 0.5 4.7 0.0 0.3 44% 0

90043.9 sda3 21.7 0.9 0.6 4.7 0.1 0.6 41% 0

...snip...

So we’re reading 20-30 MB/s from these disks, with average latencies generally under a millisecond, and we’re writing a few MB/s with latencies about the same. The device is active about 40% to 50% of the time, but given that we know there are 3 pairs of drives behind it, the device wouldn’t be 100% utilized until average read concurrency reached 6 and write concurrency reached 3. One of the samples shows the performance during a period of pretty high utilization. (Note: A read concurrency of 2.7 for a 6-device array, plus a few writes that have to be sent to both devices in a mirrored pair, is roughly 50% read utilized. This is one of the reasons I wrote the diskstats tool — you can understand busy-ness and utilization correctly. The %util that is displayed by iostat is confusing, and you have to do some tedious math to get something approximating the real device utilization, but reads and writes aren’t broken out, so you can’t actually understand performance and utilization in this level of detail from the output of iostat.)

I would characterize this as very good performance. Sub-millisecond latencies to disk, pretty much consistently, for reads and writes, is very good. I’m averaging across large periods of time here, so you can’t really see it, but there are significant spikes of load on the server during these times, and the disks keep responding in less than 2ms. I could zoom in to the 30-second level or 1-second level and show you that it remains the same from second to second. I could slice and dice this data all different ways, but it would be boring, because it looks the same from every angle.

Now let’s switch to a different customer’s system. This one is a popular online retail store, running in the Amazon cloud. It’s a quadruple extra large EC2 server (currently priced at $2/hour for a non-reserved instance) with a software RAID10 array of EBS volumes. As time has passed, they’ve added more EBS volumes to keep up with load. Between the time I sampled statistics last week and now, they went from a 10-volume array to a 20-volume array, for example. But the samples I’ll show are from an array of 10 x 100GB EBS volumes. EBS currently lists at $0.10 per GB-month of provisioned storage, which I calculate should cost $100 per month. I grabbed the counters from /proc/diskstats and computed I/O operations done so far on this array, at the list price of $0.10 per 1 million I/O requests, to be $126. This is over a long period of time, so the counters could have wrapped, but let’s assume not. So this disk array might cost $1500 or so per year. What do we get for that? Let’s look at some second-by-second output of the diskstats tool during a 30-second period, aggregated over all of the 10 EBS volumes:

  #ts device rd_mb_s rd_cnc   rd_rt wr_mb_s wr_cnc   wr_rt busy in_prg
  1.0 {10}       1.4    0.7   118.1     0.3    0.0     9.7   3%      0
  2.0 {10}       4.9    0.4    24.2     0.5    0.0     3.9   8%      0
  3.0 {10}       5.8    0.8    42.0     2.1    0.1     6.5  16%      0
...snip...
 20.2 {10}       2.6    0.1    12.1     0.3    0.0     4.0   3%      0
 21.2 {10}       2.3    0.2    27.7     0.3    0.0     2.3   4%      0
 22.2 {10}       6.6    0.6    28.8     0.1    0.0     5.8  11%     19
 23.2 {10}       0.6    0.4   132.5     0.3    0.0     9.7   1%      0
 24.2 {10}       2.5    0.1    13.4     0.3    0.0     3.8   3%      0
 25.2 {10}       2.6    0.1    12.8     0.3    0.0     4.2   3%      0

#ts device rd_mb_s rd_cnc rd_rt wr_mb_s wr_cnc wr_rt busy in_prg

1.0 {10} 1.4 0.7 118.1 0.3 0.0 9.7 3% 0

2.0 {10} 4.9 0.4 24.2 0.5 0.0 3.9 8% 0

3.0 {10} 5.8 0.8 42.0 2.1 0.1 6.5 16% 0

...snip...

20.2 {10} 2.6 0.1 12.1 0.3 0.0 4.0 3% 0

21.2 {10} 2.3 0.2 27.7 0.3 0.0 2.3 4% 0

22.2 {10} 6.6 0.6 28.8 0.1 0.0 5.8 11% 19

23.2 {10} 0.6 0.4 132.5 0.3 0.0 9.7 1% 0

24.2 {10} 2.5 0.1 13.4 0.3 0.0 3.8 3% 0

25.2 {10} 2.6 0.1 12.8 0.3 0.0 4.2 3% 0

In terms of throughput, we’re getting a couple megabytes per second of reads, and generally less than a megabyte per second of writes, but the performance (latency) is both large and variable even though the devices are idle most of the time. In some time periods, the write latency gets into the 30s of milliseconds, and the read latency goes above 130 milliseconds. That is average over a period of one second per sample, and over 10 devices all aggregated together in each line.

I can switch the view of the same data to look at it disk-by-disk, aggregated over the entire 30 seconds that I captured samples of /proc/diskstats. Here are the statistics for the EBS volumes over that time period:

  #ts device rd_mb_s rd_cnc   rd_rt wr_mb_s wr_cnc   wr_rt busy in_prg
 {29} sdi5       0.4    0.3    24.5     0.1    0.1    11.1   6%      0
 {29} sdi2       0.4    0.5    38.3     0.1    0.1    15.5   8%      0
 {29} sdj5       0.4    0.3    22.6     0.2    0.1    13.5   7%      0
 {29} sdi4       0.4    0.2    14.5     0.1    0.1    11.7   6%      0
 {29} sdi3       0.4    0.4    27.9     0.1    0.1    11.3   7%      0
 {29} sdj3       0.4    0.4    35.9     0.1    0.1    11.4   8%      0
 {29} sdi1       0.4    0.3    20.3     0.1    0.1    16.4   6%      0
 {29} sdj4       0.4    0.4    30.9     0.2    0.1    13.4   8%      0
 {29} sdj2       0.4    0.3    20.6     0.1    0.1    15.0   7%      0
 {29} sdj1       0.4    0.9    71.8     0.1    0.1    12.2   9%      0

#ts device rd_mb_s rd_cnc rd_rt wr_mb_s wr_cnc wr_rt busy in_prg

{29} sdi5 0.4 0.3 24.5 0.1 0.1 11.1 6% 0

{29} sdi2 0.4 0.5 38.3 0.1 0.1 15.5 8% 0

{29} sdj5 0.4 0.3 22.6 0.2 0.1 13.5 7% 0

{29} sdi4 0.4 0.2 14.5 0.1 0.1 11.7 6% 0

{29} sdi3 0.4 0.4 27.9 0.1 0.1 11.3 7% 0

{29} sdj3 0.4 0.4 35.9 0.1 0.1 11.4 8% 0

{29} sdi1 0.4 0.3 20.3 0.1 0.1 16.4 6% 0

{29} sdj4 0.4 0.4 30.9 0.2 0.1 13.4 8% 0

{29} sdj2 0.4 0.3 20.6 0.1 0.1 15.0 7% 0

{29} sdj1 0.4 0.9 71.8 0.1 0.1 12.2 9% 0

So over the 30-second period (shown as {29} in the output because the first sample is merely used as a baseline for subtracting from other samples), we read and wrote about half a megabyte per second from each volume, and got read latencies varying from the teens to the seventies of milliseconds, and write latencies in the teens. Note how variable the quality of service from these EBS volumes is — some are fast, some are slow, even though we are asking the same thing from all of them (I wrote on my own blog about the reasons for this and the wrench it throws into capacity planning). If we zoom in on a particular sample — say the sample taken at a 23.2 second delta since the beginning of the period — we can see non-aggregated statistics:

  #ts device rd_mb_s rd_cnc   rd_rt wr_mb_s wr_cnc   wr_rt busy in_prg
 23.2 sdi5       0.0    0.0     0.0     0.0    0.0     2.5   0%      0
 23.2 sdi2       0.0    0.0     0.0     0.0    0.0     2.0   0%      0
 23.2 sdj5       0.0    0.0     0.0     0.0    0.0     1.0   0%      0
 23.2 sdi4       0.0    0.0     0.0     0.1    0.0     2.2   1%      0
 23.2 sdi3       0.0    0.0     0.0     0.1    0.0     2.5   1%      0
 23.2 sdj3       0.0    0.0     0.0     0.1    0.0     3.0   1%      0
 23.2 sdi1       0.0    0.0     0.0     0.0    0.0     3.0   1%      0
 23.2 sdj4       0.3    1.9   161.4     0.0    0.1    48.7   5%      0
 23.2 sdj2       0.0    0.0     0.8     0.1    0.0     2.3   1%      0
 23.2 sdj1       0.3    1.8   176.8     0.0    0.1    34.0   1%      0

#ts device rd_mb_s rd_cnc rd_rt wr_mb_s wr_cnc wr_rt busy in_prg

23.2 sdi5 0.0 0.0 0.0 0.0 0.0 2.5 0% 0

23.2 sdi2 0.0 0.0 0.0 0.0 0.0 2.0 0% 0

23.2 sdj5 0.0 0.0 0.0 0.0 0.0 1.0 0% 0

23.2 sdi4 0.0 0.0 0.0 0.1 0.0 2.2 1% 0

23.2 sdi3 0.0 0.0 0.0 0.1 0.0 2.5 1% 0

23.2 sdj3 0.0 0.0 0.0 0.1 0.0 3.0 1% 0

23.2 sdi1 0.0 0.0 0.0 0.0 0.0 3.0 1% 0

23.2 sdj4 0.3 1.9 161.4 0.0 0.1 48.7 5% 0

23.2 sdj2 0.0 0.0 0.8 0.1 0.0 2.3 1% 0

23.2 sdj1 0.3 1.8 176.8 0.0 0.1 34.0 1% 0

During that time period, two of these devices were responding in worse than 160 and 170 milliseconds, respectively. One final zoom-in on a specific device, and I’ll stop belaboring the point. Let’s look at the performance of sdj1 over the entire sample period:

  #ts device rd_mb_s rd_cnc   rd_rt wr_mb_s wr_cnc   wr_rt busy in_prg
  1.0 sdj1       0.6    4.1   228.3     0.0    0.1    45.3  11%      0
  2.0 sdj1       0.5    1.4    90.8     0.0    0.0     4.0  25%      0
  3.0 sdj1       0.6    0.2    12.3     0.3    0.1     4.8   9%      0
  4.0 sdj1       0.2    0.1     8.7     0.0    0.0     1.4   2%      0
  5.0 sdj1       0.7    4.2   192.6     0.0    0.0     2.2  23%      0
  6.1 sdj1       0.7    4.3   196.9     0.0    0.1    34.0  23%      0
  7.1 sdj1       0.4    1.7   123.4     0.0    0.2   204.0  22%      0
  8.1 sdj1       0.4    0.1     8.3     0.0    0.0     2.3   3%      0
  9.1 sdj1       0.2    0.0     7.7     0.1    0.0     3.0   2%      2
 10.1 sdj1       0.8    0.5    18.7     0.0    0.0     1.8   8%      0
 11.1 sdj1       0.5    0.1     4.5     0.1    0.0     4.6   4%      0
 12.1 sdj1       0.5    0.1     4.4     0.0    0.0     1.8   3%      0
 13.1 sdj1       0.7    5.1   235.2     0.0    0.0     1.7  28%      0
 14.1 sdj1       0.5    0.1     5.8     0.0    0.0     1.0   4%      2
 15.1 sdj1       0.0    0.0     0.0     1.6    1.2    12.7  15%      0
 16.1 sdj1       0.3    0.2    16.2     0.0    0.0     5.5   3%      0
 17.2 sdj1       0.2    0.1    12.2     0.0    0.0     3.5   2%      0
 18.2 sdj1       0.2    0.1    10.6     0.0    0.0     3.2   2%      0
 19.2 sdj1       0.2    0.1    13.4     0.0    0.0     4.2   3%      0
 20.2 sdj1       0.2    0.1    14.1     0.0    0.0     3.7   2%      0
 21.2 sdj1       0.2    0.1    10.2     0.0    0.0     2.5   2%      0
 22.2 sdj1       0.4    0.3    19.1     0.0    0.0     1.0  23%     10
 23.2 sdj1       0.3    1.8   176.8     0.0    0.1    34.0   1%      0
 24.2 sdj1       0.2    0.1    16.0     0.0    0.0     3.5   3%      0
 25.2 sdj1       0.2    0.1     9.4     0.1    0.0     2.8   2%      0
 26.2 sdj1       0.2    0.1    13.1     1.5    1.4    17.4  15%      0
 27.2 sdj1       0.0    0.0     0.0     0.0    0.0     1.2   0%      0
 28.3 sdj1       0.8    0.5    19.3     0.1    0.0     3.2   8%      0
 29.3 sdj1       0.0    0.0     0.0     0.1    0.0     9.5   4%      0

#ts device rd_mb_s rd_cnc rd_rt wr_mb_s wr_cnc wr_rt busy in_prg

1.0 sdj1 0.6 4.1 228.3 0.0 0.1 45.3 11% 0

2.0 sdj1 0.5 1.4 90.8 0.0 0.0 4.0 25% 0

3.0 sdj1 0.6 0.2 12.3 0.3 0.1 4.8 9% 0

4.0 sdj1 0.2 0.1 8.7 0.0 0.0 1.4 2% 0

5.0 sdj1 0.7 4.2 192.6 0.0 0.0 2.2 23% 0

6.1 sdj1 0.7 4.3 196.9 0.0 0.1 34.0 23% 0

7.1 sdj1 0.4 1.7 123.4 0.0 0.2 204.0 22% 0

8.1 sdj1 0.4 0.1 8.3 0.0 0.0 2.3 3% 0

9.1 sdj1 0.2 0.0 7.7 0.1 0.0 3.0 2% 2

10.1 sdj1 0.8 0.5 18.7 0.0 0.0 1.8 8% 0

11.1 sdj1 0.5 0.1 4.5 0.1 0.0 4.6 4% 0

12.1 sdj1 0.5 0.1 4.4 0.0 0.0 1.8 3% 0

13.1 sdj1 0.7 5.1 235.2 0.0 0.0 1.7 28% 0

14.1 sdj1 0.5 0.1 5.8 0.0 0.0 1.0 4% 2

15.1 sdj1 0.0 0.0 0.0 1.6 1.2 12.7 15% 0

16.1 sdj1 0.3 0.2 16.2 0.0 0.0 5.5 3% 0

17.2 sdj1 0.2 0.1 12.2 0.0 0.0 3.5 2% 0

18.2 sdj1 0.2 0.1 10.6 0.0 0.0 3.2 2% 0

19.2 sdj1 0.2 0.1 13.4 0.0 0.0 4.2 3% 0

20.2 sdj1 0.2 0.1 14.1 0.0 0.0 3.7 2% 0

21.2 sdj1 0.2 0.1 10.2 0.0 0.0 2.5 2% 0

22.2 sdj1 0.4 0.3 19.1 0.0 0.0 1.0 23% 10

23.2 sdj1 0.3 1.8 176.8 0.0 0.1 34.0 1% 0

24.2 sdj1 0.2 0.1 16.0 0.0 0.0 3.5 3% 0

25.2 sdj1 0.2 0.1 9.4 0.1 0.0 2.8 2% 0

26.2 sdj1 0.2 0.1 13.1 1.5 1.4 17.4 15% 0

27.2 sdj1 0.0 0.0 0.0 0.0 0.0 1.2 0% 0

28.3 sdj1 0.8 0.5 19.3 0.1 0.0 3.2 8% 0

29.3 sdj1 0.0 0.0 0.0 0.1 0.0 9.5 4% 0

Yes, that’s right, average latency during some of those samples was over 230 milliseconds per operation. At that rate you could expect to get about 4 reads per second from that device. You can probably guess that this database server is in pretty severe trouble, waiting 230 milliseconds for the disk to respond. Indeed it is — and as you can see, it’s not like we’re really asking all that much of the disks. We’re trying to read a total of 3.7 MB/s and write 1.4 MB/s, and we’re being stalled for a quarter of a second sometimes. This is causing acute performance problems for the database, manifested as epic stalls and server lockups which show up as sky-high latency spikes in New Relic.

Suddenly EC2 doesn’t seem like such a good deal for this database after all (I emphasize, for this database). To summarize:

- Server one in the datacenter is maybe a $10k machine with a $3000 disk array (say $4000 total per year plus colo costs, if you buy the server and rent a rack), responding to the database in generally sub-millisecond latencies, at a throughput of 30-40MB/s with quite a bit of headroom for more throughput.

- Server two in the cloud costs about $17k to run per year, plus about $1500 per year in disk cost (up to $3000 per year now that they’ve added 10 more volumes), and is responding to the database in the tens and hundreds of milliseconds — highly variable from second to second and device to device — and causing horrible database pile-ups.

- We’re comparing apples and oranges no matter what, but put simply, price is in the same order of magnitude, but performance is two to three orders of magnitude different.

I don’t want to be seen as bashing cloud computing or any cloud platform. It is my intention to show that under these circumstances, EBS doesn’t deliver good QoS or a good price-to-performance ratio for the database as compared to a handful of SSDs (or traditional disks, for that matter, which would show writes in the sub-ms range and reads in the 3-5ms range). We have a lot of customers running databases in various cloud services with great results. In particular, it can be made to work really well if the database fits in memory and isn’t write-intensive. But the high and unpredictable latency of EBS storage means that if the active set of data stops fitting in memory, or if a lot of transactional writes are required, then performance takes a severe hit.

In conclusion, in a knock-down-drag-out fight between the EBS gang and the SSD thugs, a small number of SSDs mops the floor with the competition, and walks away with the prize.

(For my future reference, the EBS sample file here was named 2011_02_10_12_29_12-diskstats).

0 0 votes

Article Rating

55 Comments

Oldest

Newest Most Voted

Robert Hodges

15 years ago

Great article Baron. We are doing some benchmarking of our QA environment on Amazon and hitting similar issues. We have also seen surprising slow CPU as well as I/O latency in some cases. Co-lo operations are not going to die out any time soon.

Moschops

15 years ago

Very interesting data – as a non-EBS EC2 database user (for a mostly read only consumer website) I was considering moving to EBS storage for the convenience and low impact of backups but hadn’t a clue there could be performance issues like this.

Mark Callaghan

15 years ago

Wow. Maybe they can use flashcache in write-through mode to use local disk as a cache. I do not like 200ms disk reads.

Colin Howe

15 years ago

We were seeing similar things and then just for fun tried out Amazon RDS. You don’t get access to the disks but we saw MySQL performance go up by about 10x on a simple large import (Went from ~3hrs to ~14 minutes).

Gary Mort

15 years ago

As you pointed out, your comparing apples to oranges.

First off, you calculated the annual server cost for EC2 based on non-reserved instance costs. But why would you do that? If you are calculating an annual cost, you should do it at reserved instance since that is what would make sense and be much closer of a comparison to hardware costs[you don’t pay 4000/year – you pay 13,000 all up front and write it off over 3 years]. Based on that, your paying 6590 up front and .56/hour, or 4905.6. So over 3 years, that is 21306.

Two, as you point out the EBS drive performance is all over the place since some of those drives will be heavily shared with others, and some won’t. On your physical server, if you had a poorly behaving drive you would toss it out and put in a new one..

On EC2 you should do the same, if an EBS drive performs badly, decommission it and grab a new one. Your cost benefit there is there is much less labor involved, and no replacement cost.

And of course, if your running such a high powered virtual system 24×7 EC2 is not the best solution. That’s good to handle spikes of traffic/need. Once you have a long term need, it’s time to consider migrating to your own hardware. An ecommerce site such as this should be scaling up and down the cloud all the time based on season and the economy…

Still, it’s good to keep in mind that any EBS volume, due to the shared nature of the system, is subject not only to hardware problems but contention issues – so just as you should monitor your physical hardware for signs of degradation, so should you monitor EBS.

Till

15 years ago

‘@Gary:

The truth is, unless AWS really fixes this, we are all subject to, “suddenly it degraded”. And not even premium support will get you a better answer for that.

AWS created this fake environment where you buy network storage and you basically get everything but performance. And I bet they will charge you, me and all of us for an add-on down the road.

In your own DC, you’re likely to know the reason why something failed. With AWS, even if you stop the raid to replace one of the EBS volumes, there is just no guarantee that the new one behaves any better. You might just replace one to realize that EBS is currently degraded in general.

Add to that that this operation is not at all realtime. 😉 In fact, I would argue that remote hands does it faster.

Good times!

Henry Goldwire

15 years ago

Hi Baron. Have you had an opportunity to profile RDS-based mysqls? Colin Howe (above) suggests it’s better, but I’d love to see some real data.

Author

Baron Schwartz

15 years ago

Gary, paying for a reserved instance doesn’t change the order of magnitude of the price, so the bolded statement at the end of the post is still true. And you are falling into the trap of believing that the EBS physical hosts shouldn’t be busy or the network shouldn’t be congested. Click through to the article linked from the text “capacity planning” to read more about this fallacy. The EBS volumes that were performing “poorly” in the data I presented are probably performing as well as you deserve to get, and the ones that are performing “well” are giving you performance you aren’t actually paying for, don’t deserve/aren’t guaranteed, and shouldn’t count on.

Author

Baron Schwartz

15 years ago

Gary, I forgot to add, the e-commerce application here is extremely spikey. That’s why it’s in the cloud. On physical hardware, they’d be over-provisioned by a factor of about 50 all but a few short periods every day.

Author

Baron Schwartz

15 years ago

Henry, Colin,

I have not profiled any RDS servers, no. It’s on my (long) TODO list.

Admin

Peter Zaitsev

15 years ago

Baron,

Why do you look at number of MB/sec instead of read and write IOPS ?
As I understand this is mere rough measure of how demanding application is from IO subsystem.

The performance from SSDs you’re showing is actually quite poor. 500 microseconds as average read response time is a lot.

Itman

15 years ago

Do I read this numbers correctly? The performance from SSDs youâ€™re showing is actually quite poor.
Peter Zaitsev, it still considerably better than under 1MB/sec read/write speed!

Author

Baron Schwartz

15 years ago

Peter, yes, the SSDs are not great. And of the handful of servers in this application’s cluster, a bunch of them had SSDs that were failing at various points. That is why I said these are relatively cheap SSDs and will likely need to be replaced. But even cheap SSDs are serving the database well enough here.

I chose MB/s instead of reads/writes per second because that’s the default columns I chose to show in the diskstats tool. For me it is more intuitive to look at this — maybe I need to become accustomed to a different metric of throughput, though.

Author

Baron Schwartz

15 years ago

Itman, yes, you do read the numbers well — this is why I also gave typical numbers for spindle-based disks and they aren’t much slower, although I should have called this out more explicitly in the post itself. (Part of the problem, actually, is probably that the LSI MegaRAID controller doesn’t know how to handle SSDs to their full advantage, so it’s probably doing some things like reordering IO operations to “optimize” them, which just adds delay. But it is hard to quantify this effect.)

Itman

15 years ago

Baron,
Spindle disks are now well over 20MB/secs, even basic ones. With a raid you can get more than 1000MB/secs. That is, typical spindle disks are way much faster!

Justin Swanhart

15 years ago

Hooking up SSD to traditional RAID controllers is a bad idea in all the testing I’ve done – even for RAID 0. Hardware RAID makes assumptions about the minimum amount of time that a disk can take to seek due to the nature of it being a mechanical device.

These assumptions are not true on SSD drives, and so the controllers impose waits which are not necessary.

I’ve tested Adaptec, LSI, 3ware, Highpoint and software RAID. Software RAID performs best for all SSD that I’ve tested, even the lower-end commodity drives.

Author

Baron Schwartz

15 years ago

And you can get much more bandwidth from these arrays, too. They are not fully utilized. But you are using the word “faster” ambiguously. Throughput, bandwidth, and latency are not the same thing. When I use the word “performance” I am always referring to latency (response time). A decent SSD will be much higher performance than any spindle-based disk, even in a big RAID array, for random reads. Sequential reads are completely different.

Author

Baron Schwartz

15 years ago

My previous comment was addressed to Itman.

Itman

15 years ago

A decent SSD will be much higher performance than any spindle-based disk, even in a big RAID array, for random reads. Sequential reads are completely different.

Yes, of course. On a second thought, 1 MBsec may be ok for applications with predominantly random reads.

Admin

Peter Zaitsev

15 years ago

Justin,
I think there are some RAID controllers coming in with SSD support. Some have special mode for SSD drives (bypassing caching, read ahead etc), for example dell H700/H800.

Itman,
The metric of MB/sec is not very helpful exactly because 10MB can be both a lot and little.
If you’re streaming data with 1MB or larger IOPS many cheap drives will get 70MB+ if you’re looking at 4K io requests the 7200RPM hard drive is able to do about 100 ios/sec which will show as 400KB/sec

The tricky thing with IOPS though it is hard to measure. If you issue 10 iops to RAID controller it still may be merged to 1 IO. I prefer looking at iops rather than MB/sec if I have to chose only one, still.

Justin Swanhart

15 years ago

Peter,

Of course. It is important to note though that the vast majority of RAID controllers on the market now (or currently deployed in the datacenter) are not designed for SSD disks.

Peter

15 years ago

Perhaps I missed it, but I don’t see above how you saturated the IO on the box. Could you explain that in more detail for those of us interested in reproducing your results?

Author

Baron Schwartz

15 years ago

I didn’t. I simply measured an existing workload.

Peter

15 years ago

Are the apps IO bottlenecked? If not, I’m a bit confused how you can draw conclusions about EBS or SSD throughput and conclude a two-to-three order difference in performance.

Certainly latency on EBS is wildly inconsistent and has other crippling bugs, such as volumes that simply become non-responsive, or APIs that don’t correctly attach or detach storage from instances, but in most circumstances I find latency is typically sub 10ms, or one order of magnitude worse than the SSD. Certainly I’ve seen latency spiking into the hundreds of milliseconds. Conversely, the storage is not resident on the host, which means when the hardware fails, your volumes can still be made available without sending someone down to the colo.

As to your price analysis, the price is on the order of 50% more for similar IO profiles, but that estimate doesn’t really account for my experience of the operational costs owning hardware entails. Those are definitely hard to quantify, so I can understand why you might have omitted them, but I do feel they are significant for an operation of significant scale.

Those criticisms aside, I very much appreciate the effort you’re taking to investigate EBS performance and to compare it to other alternatives. Shining this kind of light on these issues is the only way to be able to make informed decisions, and we all benefit by having access to plenty of data.

Patrick Casey

15 years ago

I’m actually with Baron on this in that the numbers he’s quoting for EBS is, from my perspective, eye poppingly slow.

Everybody’s workload is going to be different of course, so what is very reasonable disk latency for one worklow is going to be catastrophically bad for another.

For the workloads I’m most familiar with though, any disk latency over 4ms (a randon IO on a 15k drive) is going to be a problem.

Even a consumer grade SATA drive can rip huge amounts of data if you just put the head down and spin the drive under it, but most of the workloads I encounter have a very high proportion of random IO and hence the disk seek time becomes much more performance relevant than throughput.

Author

Baron Schwartz

15 years ago

Peter, the ecommerce application (on EBS) is getting quite heavily IO-bound on both reads and writes, even though it looks like there is more capacity available. The blogging platform is actually doing a larger volume of IO, as you can see, but is not IO-bound because the IO is faster.

I’ve seen a lot of EBS volumes behave like these. I haven’t seen one respond consistently in the sub-10-ms range yet, although I might not have noticed — I typically only pay attention to things that are a problem.

Justin Swanhart

15 years ago

What really bothers me is that the different instance sizes are characterized as having different IO capabilities, but EBS performance can be just as bad on a 2x extra large as on a small instance.

Sean Hull

15 years ago

Baron, great article. The conclusion that EC2 loses hands down on performance could as well be framed as – performance is not the driving force to cloud adoption. Definitely agree. The same could be said for Sun servers versus commodity hardware ten years ago. However people moved to commodity hardware in droves.

Another way of looking at the question of “Is it a good idea to deploy your database in the cloud” would be to consider the pros and cons and decide on balance which way to go. You’ve illustrated the cons clearly. The pros though include flexibility in disaster recovery, easier roll out of patches and upgrades, and reduced operational headache. But by far the biggest pressure I’ve seen is an economic one. Many web applications tend to exhibit so-called seasonal traffic patterns, that is they have a peak requirement which might be ten or a hundred times their day-to-day needs. With traditional hosting, or hosting servers in-house, you have to buy for that peak load. With cloud servers you don’t. What’s more managers unsure of their business direction can tryout things, knowing that can shift directions or turnoff the servers after a few months if they like.

I always quote Larry Summers, he called this “preserving optionality” that is waiting as long as possible before making decisions you can’t turn back from. I speak to many managers who see these pressures foremost in their mind and prefer to frame the question as something like: “Can we deploy in the cloud and settle with the performance limitations, and if so how do we get there?”

Robert Hodges

15 years ago

‘@Sean
It’s unclear that some of the pros you mention really apply to databases. It’s hard to argue bad or still worse inconsistently bad performance reduces operational headache. For many organizations database performance problems are business threatening, especially as in the EBS numbers cited above where your DBMS would essentially stop working and you would be completely in the dark as to why.

Also while it is possible to scale stateless web services efficiently by provisioning more of them, the same does not apply to DBMS servers. Horizontal scaling in databases is not a one-size-fits-all solution–your applications typically need to follow certain conventions to get the most out of it, even if you don’t actually rewrite code. The economic argument for many installations is therefore more in favor of mixed environments where your app tier is virtual and can scale quickly while running the data tier on the best hardware available.

Mark Callaghan

15 years ago

Sean – Given that Baron’s post is full of numbers and details, do you have any to provide?

Author

Baron Schwartz

15 years ago

Sean didn’t quote any numbers, but his point is valid; one of the applications in the post is hosted in the cloud specifically because of their dynamic peaks. They stand up and tear down web servers constantly. I didn’t do the economic analysis, so I can’t say for sure that they chose the cheapest option, but I think they believe they did.

I recently told the client “I can only think of two options for you: leave EC2, or shard your database.” I don’t like either one, but even with tactics such as aggressive archiving and purging, I don’t think they can stay on a single EC2 server for the database, and adding EBS volumes isn’t going to solve the read performance issue — and reads are always blocking and can’t be done in the background in general. But based on earlier comments, I need to evaluate RDS seriously. So perhaps EBS/EC2 isn’t the right option for them, but there is still hope that “the cloud” is.

Mark Callaghan

15 years ago

How is setting up and tearing down web servers relevant in determining whether to use EC2 for database servers?

With respect to setting up and tearing down databases, that works to some extent for scaling out reads and you can get a few tens of slaves to replicate from a MySQL master. And EBS snapshots are probably a great way to do that.

But I am skeptical that you can do the same to scale writes. Increase or decreasing the number of masters by 10X isn’t easy to automate nor is it likely to be done without downtime.

Matt Reid

15 years ago

I’ve seen similar issues with cloud based database environments. The last env I moved from EC2 was having such horrible IO performance that just moving it to a basic R710 with 10K local SAS gave the app a 10x performance boost. The cloud is great for development and offloading of compute jobs but it routinely fails for enterprise database use.

Author

Baron Schwartz

15 years ago

Why use EC2 for the DB just because that’s where the web servers are is a good question. I have assumed that they wanted to stay all in one place. But I haven’t measured latency between EC2 and other hosting facilities, so I’m not sure how much of a difference it makes. They don’t vary the number of DB servers, only the web servers as far as I know.

Admin

Peter Zaitsev

15 years ago

I think the problem with very low latency (think Flash Cards PCI-E) is it is very hard to scale on Amazon scale. If you’re building large scale cloud for tens of thousands of nodes (at least) in single facility you will be facing complex network topologies and relatively high latencies.

It would be interesting to see if solution emerges some time soon and we can get very low latency connection on the large scale.

For smaller scale I’ve seen people getting a fast storage for the cloud by using essentially dedicated Infiniband interconnect to the flash storage.

Dimitri

15 years ago

Baron,

any way to extend the presented diskstats output with R/W Op/sec too?.. – it’ll avoid all such confusions coming in mind about performance (and seen in comments, agree with PeterZ here)..

then, I’m also curious by the cases when the dataset IS matching the memory limit, but the workload is heavy writing. – do you have any numbers for this case too?..

Thanks for the article!

Rgds,
-Dimitri

Avi Kapuya

15 years ago

I agree SSD is a key missing component in the cloud, and in the same breath, 10Gb networks, which would allow low latency and high throughput as well. We discuss these in our recent blog post http://blog.xeround.com/2011/02/two-almost-new-technologies-bound-to-have-a-big-impact-on-the-cloud

Patrick Casey

15 years ago

Peter, I’m not sure if the poor EBS performance is a wiring problem. On a switched network with cuthrough switches, you can wire up a very large data center with < 1ms latencies point to point (probably more like 100us). I was actually just out at a pair of DC 6 miles apart with a point to point ping time from a server to a server (not edge) of about 600us.

That said, I've no particular knowledge of how amazon implemented EBS, so I don't have an alternate theory as to why it'd be this slow. It could just be grossly oversubscribed (a common enough problem on shared storage arrays) so if the guy on the next VM happens to be thrashing your heads, you get less IOops, but that's just speculation.

Author

Baron Schwartz

15 years ago

If the interconnect to EBS uses TCP/IP, then a congested network could be a serious problem too, with lots of backoff and retry.

Author

Baron Schwartz

15 years ago

I knew I needed to save the name of the diskstats sample file for my future reference. Dimitri, you can download the sample file at http://www.mysqlperformanceblog.com/wp-content/uploads/2011/02/2011_02_10_12_29_12-diskstats.txt and run the diskstats tool on it however you like.

Robert Hodges

15 years ago

Does anybody know how EBS is really implemented? From within instances it just looks like a block device.

Admin

Peter Zaitsev

15 years ago

Patrick,

Yes… when you’re looking at spikes to 200000us in response time 600us does not matter, if you look at PCI-E Flash kind of latencies you get response time well within 100us, it matters. Though I’m sure it is not just wiring problem 🙂

Peter Boros

15 years ago

I think the biggest problem is not the bad performance here. It’s the unpredictable performance. I wonder if these measurements on EBS are repeatable. Are they repeatable in an other availability zone? What do they look like at different times of the day? I have never done such a thing, it would be intresting, but my guess is that they aren’t repeatable. Maybe the distribution of iops and throughput would be more meaningful.

Anyway, the post is great, Baron’s point is clearly visible from these measurements. If one would examine the distribution of io metrics, and maybe distribution of the application’s response times would show an even worse picture for the cloud.

Gary Mort

15 years ago

A quick follow up, 2 options I didn’t notice mentioned a while back for this use case:
1) Don’t use EBS on a huge instance. As you mentioned, one of the problems with resources is you may get more than what you pay for. This is especially true of small instances where many ‘systems’ are running on one real system. But when you have a huge instance like this, it is quite likely that you are the only one running on the system. At that point, the benefit of EBS[less contention vs the ephermeral drives] is reversed as suddenly you have exclusive access to those drives.

2) Scale the system in different ways. Use Amazon Simple DB, SQS , and SNS instead of a huge relational database server. For example, you have a product database of almost all static, rarely changing data. Why store it on a master database server? Stick all that data down on the web servers. Use SNS to push important changes in 2 key pieces of information quickly to those servers[price and quantity available], everything else can be periodically refreshed. Don’t use the lazy purchase process of authorize/take payment in one step. Instead, the web server itself can authorize the payment[or perhaps have a few ‘order processing’ instances. Once authorized, thank the customer and let them know their order has been submitted for processing. Than pump that authorization into an SQS where you master database can process the orders and for any where the information had been out of date[price had changed a couple seconds before the authorization, or quantity had gone too low] you can ship back a notification that the order is on hold and let the customer re-verify.

None of this is rocket science, it’s what Amazon itself designed their system to do before they started selling it as a service due to overcapacity. The main issue is store processes that have fallen into traps based on what could be done 10 years ago and assuming everything is in a single database.

The other path to take is Amazons, buy the physical hardware you need for peak performance, stick Eucalyptus on your hardware, and then sell your excess capacity to others or use it for internal development.

Patrick Casey

15 years ago

Gary,

I think most of us here are capable of working around various limitations in our deployment stack. If I *had* to run an a large IO intensive application in the Amazon cloud there are various engineering tweaks I could make to keep performance up.

When you get down to it though, all those changes to the application cost money, either in hard dollars or in opportunity costs b/c I’m busy changing working code out for alternate working code that will run better on Amazon.

If, alternately, I just spend about $6k on a nice big 48G dell rack mount with a good raid array, I don’t have to make any of those tweaks b/c I’ll have a local IO subsystem that performs the way the application already expects it to.

Point being, I suppose, that just because you can modify a running application to work better in another deployment environment doesn’t mean its a good idea or a good use of engineering resources. There’s nothing inherently “wrong” with running on physical boxes backed by an RDBMS, for many business problems that’s precisely the right deployment stack.

That being said, I’ve never personally tried to run anything big on EC2 (a few test boxes is about it), so all I can go on is the engineering specs and reviews like this.

— Pat

Mark Callaghan

15 years ago

Is there work in progress to make it easier to run MySQL on local storage in EC2? Is Amazon or Rackspace interested in sponsoring such projects?

If you can tolerate some downtime but prefer not to lose transactions then run MySQL on EC2 local storage, use XtraBackup to store backups on EBS and archive the binlog in realtime on EBS. When the master fails setup a new one using the last backup and replay the archived binlog.

This requires a tool to archive the binlog in realtime (as fast as it is written). This also requires something to detect the failed master and setup a new one. AFAIK these tools do not exist as open-source.

Patrick Casey

15 years ago

Mark,

Apologies in advance if this sounds like a leading question, but why bother with all of that? Wouldn’t it be easier just to get a “hard” server?

Maybe my math is off but the price points for an EC2 instance (double extra large):
4 cores, 32G

$1/hour * 24 hours/day * 30 days/month = $720 a month

I can get a physical at a mid range provider like the planet for almost exactly the same price
I can buy my own 1U rack mount with those specs for about $3k and it’ll cost me maybe $200 a month to colo it.

Point being I can’t see how the math works out on EC2 instances. If they were dramatically cheaper than buying physicals, I’d think that doing exotic stuff to make an app run well on them make sense.

Since they’re no cheaper than physicals though (and a lot more expensive than a colo from what I can see), I don’t see the attraction.

Is it because they’re easy to burst? Because the cost of entry is low so it makes sense if you’re low on cash? Something else I’m missing?

Author

Baron Schwartz

15 years ago

Mark, the binlog copy can be done with DRBD, which is one of the ways we have suggested in the past to use DRBD — it generally adds too much overhead to DRBD-replicate the whole data directory.

Patrick, I think that the economics really appear different depending on point of view. Which way is right I can’t say. Some, such as the author of Cloud Application Architectures, conclude that the AWS cloud is way cheaper than physical hardware. I haven’t seen that yet myself — mainly because you’ve just got to buy so much more of it!

Gary Mort

15 years ago

Patrick,

The point of running in any environment is to pick your costs. If you buy that nice physical server for your client, then your client now has to deal with hardware problems, replacing drives, etc. Plus now the cost went from a service cost to a capital cost and they have to deal with depreciation.

Note: anything which runs on a large instance in Amazon day in and day out, in my mind is something that ought to be looked into moving outside of Amazon and on to physical hardware.

However, for a swiftly growing business,”the cloud” makes sense from many business viewpoints[including that it “sounds sexy” which makes it easier to attract investors].

In the case where running on the cloud makes sense, it is important to realize your options. Take those half dozen ‘ephermeral’ drives you have on the instance and attach RAID to them, move off of EBS and do a cost analysis of losing an hours worth of data if you backup every hour vs shoving all updates into an SQS queue immediately and pulling the backup from there.

Gary Mort

15 years ago

Baron,

I was curious if you have done any testing/analysis on the RDS service?
http://aws.amazon.com/rds

I must admit….every time I consider using these specialized services, RDS, SNS, SES, SQS, SDB it truly gives me the willies to give up the control over the system that I have by running everything on a box[virtual or real] where I get to configure any of that.

What is hard to wrap my mind around is that giving up such control means less moving parts for the client to administer.