November 23, 2014

Why %util number from iostat is meaningless for MySQL capacity planning

Earlier this month I wrote about vmstat iowait cpu numbers and some of the comments I got were advertising the use of util% as reported by the iostat tool instead. I find this number even more useless for MySQL performance tuning and capacity planning.

Now let me start by saying this is a really tricky and deceptive number. Many DBAs who report instances of their systems having a very busy IO subsystem said the util% in vmstat was above 99% and therefore they believe this number is a good indicator of an overloaded IO subsystem.

Indeed – when your IO subsystem is busy, up to its full capacity, the utilization should be very close to 100%. However, it is perfectly possible for the IO subsystem and MySQL with it to have plenty more capacity than when utilization is showing 100% – as I will show in an example.

Before that though lets see what the iostat manual page has to say on this topic – from this main page we can read:

%util

Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100% for devices serving requests serially. But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits.

Which says right here that the number is useless for pretty much any production database server that is likely to be running RAID, Flash/SSD, SAN or cloud storage (such as EBS) capable of handling multiple requests in parallel.

Let’s look at the following illustration. I will run sysbench on a system with a rather slow storage data size larger than buffer pool and uniform distribution to put pressure on the IO subsystem. I will use a read-only benchmark here as it keeps things more simple…

sysbench –num-threads=1 –max-requests=0 –max-time=6000000 –report-interval=10 –test=oltp –oltp-read-only=on –db-driver=mysql –oltp-table-size=100000000 –oltp-dist-type=uniform –init-rng=on –mysql-user=root –mysql-password= run

I’m seeing some 9 transactions per second, while disk utilization from iostat is at nearly 96%:

[ 80s] threads: 1, tps: 9.30, reads/s: 130.20, writes/s: 0.00 response time: 171.82ms (95%)
[ 90s] threads: 1, tps: 9.20, reads/s: 128.80, writes/s: 0.00 response time: 157.72ms (95%)
[ 100s] threads: 1, tps: 9.00, reads/s: 126.00, writes/s: 0.00 response time: 215.38ms (95%)
[ 110s] threads: 1, tps: 9.30, reads/s: 130.20, writes/s: 0.00 response time: 141.39ms (95%)

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
dm-0 0.00 0.00 127.90 0.70 4070.40 28.00 31.87 1.01 7.83 7.52 96.68

This makes a lot of sense – with read single thread read workload the drive should be only used getting data needed by the query, which will not be 100% as there is some extra time needed to process the query on the MySQL side as well as passing the result set back to sysbench.

So 96% utilization; 9 transactions per second, this is a close to full-system capacity with less than 5% of device time to spare, right?

Let’s run a benchmark with more concurrency – 4 threads at the time; we’ll see…

[ 110s] threads: 4, tps: 21.10, reads/s: 295.40, writes/s: 0.00 response time: 312.09ms (95%)
[ 120s] threads: 4, tps: 22.00, reads/s: 308.00, writes/s: 0.00 response time: 297.05ms (95%)
[ 130s] threads: 4, tps: 22.40, reads/s: 313.60, writes/s: 0.00 response time: 335.34ms (95%)

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
dm-0 0.00 0.00 295.40 0.90 9372.80 35.20 31.75 4.06 13.69 3.38 100.01

So we’re seeing 100% utilization now, but what is interesting – we’re able to reclaim much more than less than 5% which was left if we look at utilization – throughput of the system increased about 2.5x

Finally let’s do the test with 64 threads – this is more concurrency than exists at storage level which is conventional hard drives in RAID on this system…

[ 70s] threads: 64, tps: 42.90, reads/s: 600.60, writes/s: 0.00 response time: 2516.43ms (95%)
[ 80s] threads: 64, tps: 42.40, reads/s: 593.60, writes/s: 0.00 response time: 2613.15ms (95%)
[ 90s] threads: 64, tps: 44.80, reads/s: 627.20, writes/s: 0.00 response time: 2404.49ms (95%)

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
dm-0 0.00 0.00 601.20 0.80 19065.60 33.60 31.73 65.98 108.72 1.66 100.00

In this case we’re getting 4.5x of throughput compared to single thread and 100% utilization. We’re also getting almost double throughput of the run with 4 thread where 100% utilization was reported. This makes sense – there are 4 drives which can work in parallel and with many outstanding requests they are able to optimize their seeks better hence giving a bit more than 4x.

So what have we so ? The system which was 96% capacity but which could have driven still to provide 4.5x throughput – so it had plenty of extra IO capacity. More powerful storage might have significantly more ability to run requests in parallel so it is quite possible to see 10x or more room after utilization% starts to be reported close to 100%

So if utilization% is not very helpful what can we use to understand our database IO capacity better ? First lets look at the performance reported from those sysbench runs. If we look at 95% response time you can see 1 thread and 4 threads had relatively close 95% time growing just from 150ms to 250-300ms. This is the number I really like to look at- if system is able to respond to the queries with response time not significantly higher than it has with concurrency of 1 it is not overloaded. I like using 3x as multiplier – ie when 95% spikes to be more than 3x of the single concurrency the system might be getting to the overload.

With 64 threads the 95% response time is 15-20x of the one we see with single thread so it is surely overloaded.

Do we have anything reported by iostat which we can use in a similar way? It turns out we do! Check out the “await” column which tells us how much the requester had to wait for the IO request to be serviced. With single concurrency it is 7.8ms which is what this drives can do for random IO and is as good as it gets. With 4 threads it is 13.7ms – less than double of best possible, so also good enough… with concurrency of 64 it is however 108ms which is over 10x of what this device could produce with no waiting and which is surely sign of overload.

A couple words of caution. First, do not look at svctm which is not designed with parallel processing in mind. You can see in our case it actually gets better with high concurrency while really database had to wait a lot longer for requests submitted. Second, iostat mixes together reads and writes in single statistics which specifically for databases and especially on RAID with BBU might be like mixing apples and oranges together – writes should go to writeback cache and be acknowledged essentially instantly while reads only complete when actual data can be delivered. The tool pt-diskstats from Percona Tookit breaks them apart and so can be much more for storage tuning for database workload. Some of the recent operating systems also ship with sysstat/iostat which breaks out await to r_await and w_await which is much more useful.

Final note – I used a read-only workload on purpose – when it comes to writes things can be even more complicated – MySQL buffer pool can be more efficient with more intensive writes plus group commit might be able to commit a lot of transactions with single disk write. Still, the same base logic will apply.

Summary: The take away is simple – util% only shows if a device has at least one operation to deal with or is completely busy, which does not reflect actual utilization for a majority of modern IO subsystems. So you may have a lot of storage IO capacity left even when utilization is close to 100%.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. nate says:

    I’ve seen lots of times where 100% util = device saturation(from a host perspective). Simultaneously the device itself has plenty of capacity (in these cases a 3PAR SAN which collects thousands of data points internally). I don’t mind too much though since that same storage system is responsible for 10s to 100s of other systems at the same time (all of which do not show I/O contention).

    I would not want a single mysql instance to consume the entire back end shared storage system! I suppose I could invest in 3PAR priority optimization (probably the only true storage QoS on the market) but to-date haven’t had a need.

    There are other queries that sometimes run that can actually saturate the back end, and the back end monitoring (and if it’s not caught in time those 100s of systems) show/feel it. percona autokill tool gets them within 60 seconds though, haven’t had an incident from them in about a year.

    wait times are good measurement too of course, but my experience tells me if things are running slow, look at that column, if it is maxed then there is the bottleneck(whether it is file system, queue depth, or something else on the host/OS level).

    Maybe the mysql instance is capable of doing more, but whatever operation(s) are driving that I/O are not likely to execute any faster (and that’s all I really care about at that point in time).

  2. Baron says:

    Adrian Cockcroft did a great paper and talk at CMG a while back, calling utilization virtually useless as a metric in general. Not just for disks. He gives a lot of very good reasons why this is so. It is well worth reading that paper to understand just how broken utilization becomes in modern architectures with virtualization and so on.

  3. ives says:

    @Peter: interesting post. we ran into severe performance issues on AWS and began looking at storage performance via iostat. while you can increase overall read/write throughput, at the expense of read/write queueing and longer I/O wait times, it’s not clear which parts of mysql are the most sensitive to increases in latency.

    transaction logs can’t tolerate added latency without having a direct impact on commits and overall transaction duration as far as the db client is concerned. but higher write queues while writing to the tablespace might be acceptable, but I/O contention while writing out dirty pages might result in performance spikes (similar to performance drops in JVMs when garbage collection kicks in).

    Being new to mysql, I’ve spent a lot of time digging around for articles on AWS EBS performance (non-PIOPS, PIOPS, ephemeral), best mount options for XFS / ext4, blocksizes for optimal EBS & mysql use, comparisons of filesystems for the networked EBS volumes, whether to stripe across EBS volumes or not, whether to use LVM or mdm, etc..

    It seems like there are some definite gaps in mysql stats for identifying database or filesystem hotspots, like monitoring I/O contention between different subsystems (transaction logs vs. binlog vs. relay log vs. tablespace files). While I suspect that independent volumes should help performance, it’s not clear how much of a performance impact this can have, given the added complexity of managing consistent snapshots across multiple volumes.

    Percona server has some great additions to improve visibility into mysql performance, as a few of your articles have shown. I’m curious if you have any further suggestions for monitoring / tuning filesystem performance, particularly for filesystems on NAS / iSCSI / EBS. For example:

    – does block size matter when creating XFS filesystems on AWS EBS volumes (given 16KB for mysql innodb tablespace vs. different block sizes for transaction logs)? Or tuning OS read-ahead on mount options to match blocksizes of mysql?

    – outside of tools like iostat, how can one identify I/O contention within mysql? specifically, which files are involved when contention occurs?

    – what’s the best way to tune buffer size to avoid excessive dirty page flushing (where large rights might cause long locks / wait times for db clients)?

  4. MatteoP says:

    This post stimulated me to play a little bit with numbers I’ve always overlooked as I’m not a sysdamin nor a mysql DBA.
    I found it very instructive.
    This is my analysis:

    Single thread;
    [ 80s] threads: 1, tps: 9.30, reads/s: 130.20, writes/s: 0.00 response time: 171.82ms (95%)

    Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
    dm-0 0.00 0.00 127.90 0.70 4070.40 28.00 31.87 1.01 7.83 7.52 96.68

    transactions analysis:
    throughput X=9.3
    number of requests in system N= 1
    N=XR -> R=N/X = 1/9.3 = 108ms (172ms 95th percentile) response time per transaction

    reads:
    reads/transactions = 130.2/9.3 = 14 reads/t

    response time per read:
    1/130.2 = 7.7 ms/read = svctm = await since there is no queue as sysbench seems to start a new request as soon as the previous completes (synchronous) and it is running single threaded.
    await is the length of queue including the request in service; this means that avgqu-sz cannot be smaller than 1

    Also avgrq-sz is 32 sectors -> 16KB as expected from mysql

    utilization = arrivals * service time = 127.9*.00752 = 96.1%
    that means: given this available capacity( 1/0.00752 = 133 reads/s) the working resource is 96% busy.
    It is not computing the other idle resources (other 3 disks)
    If you know that the io subsystem is composed of 4 disks all capable to provide the same data at the same rate then you can divide that busy pct by 4 -> 24%
    If the io subsystem is composed by four disks storing different data and sysbench requests data always from the same disk then the throughput won’t increase over 10 tps no matter
    how many threads you run. In this case that busy percentage makes a lot of sense to me.

    With 4 threads:
    [ 110s] threads: 4, tps: 21.10, reads/s: 295.40, writes/s: 0.00 response time: 312.09ms (95%)

    Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
    dm-0 0.00 0.00 295.40 0.90 9372.80 35.20 31.75 4.06 13.69 3.38 100.01

    transactions
    X=21.1
    N=4
    R=4/21.1 = 190ms (312ms 95th percentile) response time per transaction

    reads:
    reads/transactions = 295.4/21.1 = 14 reads/t this seems constant

    Here something magic happens: service time decreases; of course it is due of parallel processing of read requests (seems 2 disks in parallel as svctm is half of the single threaded test)
    service time per read:
    1/295.4 = 3.38 ms/read = svctm; await is 4 times svctm as the avgqu-sz = 4.

    utilization = arrivals * service time = 295.4*.00338 = 100%

    Why only 2 disks are serving the requests when there are 4 threads making request?
    Maybe it depends on the kind of queries executed by sysbench; maybe it is requesting data stored only in 2 disks. It would be be interesting seeing which queries sysbench is running and understanding disks layout.

    With 64 threads
    [ 70s] threads: 64, tps: 42.90, reads/s: 600.60, writes/s: 0.00 response time: 2516.43ms (95%)

    Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
    dm-0 0.00 0.00 601.20 0.80 19065.60 33.60 31.73 65.98 108.72 1.66 100.00

    R = 64/42.9 = 1492ms (2516ms 95th percentile)

    Utilization 600*1.66 = 100%

    svctm halved again, meaning all 4 disks are serving requests

    Going back to the original question:
    when a io subsystem is overloaded in a io bound workload?
    My answer is:
    If I care about response time, then when load makes the response time break my SLA
    If I care about throughput, then when adding more load does not make throughput increase

    In a parallel io subsystem having %util = x just means that it is (100 -x) % idle.
    With this type on benchmarks I would expect to see always a very high %util unless the number of requests is limited ( –max-requests != 0)

Speak Your Mind

*