# The two even more fundamental performance metrics

In a recent blog post, I wrote about four fundamental metrics for system performance analysis. These are throughput, residence time, “weighted time” (the sum of all residence times in the observation period — the terminology is mine for lack of a better name), and concurrency.

I derived all of these metrics from two “even more fundamental” performance metrics, which I’ll call the “basic” metrics. That’s what this post is about. But it doesn’t stop there, because my list of four fundamental metrics isn’t the only thing you can derive from the two basic metrics. In fact, these two basic metrics encode a tremendous amount of information, from which you can compute lots of other things. And by the way, it’s a good thing the two basic metrics really are basic, because beyond “fundamental” and “basic,” I am about out of descriptive words.

The metrics I explained in the last post were all about observing events and noting the point in time at which they occurred. That’s the essence of performance: understanding events at points in time. And in a system that accepts, processes, and replies to requests (queries), the most basic events we want to timestamp are the arrival and completion of the requests. So that’s it. Ladies and gentlemen (drum roll please), the two basic performance metrics:

1. Arrival time
2. Completion time

If we know the instant in time at which every request arrives and completes, and we observe the system over an interval (the observation period), we can compute response times of individual queries. This is the completion time minus the arrival time (a duration). We can also compute the average concurrency, throughput, response time, and utilization during the observation period: see the previous post for how to compute these practically for free.

However, we can also compute a number of other very interesting things, because we have the atomic data, not just the aggregated data that I showed how to maintain cheaply in the previous post. Here are some things we can do:

1. Plot the data as time-series or in scatter plots. It might not seem useful, but it certainly is, because it can make it obvious when something is really an aberration and you need to throw away part of the data set to clean it up for modeling.
2. Plot arrivals versus completions. I’ll show an example shortly.
3. Analyze the distribution of arrival times. This is necessary to apply more advanced types of queueing analysis to the data, such as using the Erlang C function to predict how the queue time will vary with utilization. You have to validate that the inter-arrival times are correctly distributed (briefly: that the arrivals are random, and not dependent on each other) before you know whether it’s valid to apply such a model.
4. Analyze the distribution of response times. You can compute far more than just the averages we’ve seen how to compute up to this point. You can calculate the standard deviation, the median, any desired percentile (depending on how large your sample is), the variance, the variance-to-mean ratio (a.k.a. the index of dispersion, an amazingly useful metric for discovering periods of poor performance), and more.
5. Analyze the distribution of service times. Service time is not the same as response time, which includes time spent waiting in the queue. If you measure a period when concurrency and utilization are sufficiently low, you can approximate service time. For that matter, if you discard all requests that overlap and hence eliminate queueing, you can actually measure service time directly. Why would you want to compute the distribution of service times? You need this for all kinds of queueing analysis. You need it for basic queueing theory, as well as more advanced models such as the Erlang functions, which require that you know the distribution of service times. Just as with arrival times, you can’t feed any old data into these models; they work only when specific conditions are true.
6. Analyze the distribution of completion times. This is extremely useful for contention analysis. Suppose that requests arrive randomly at the server; that means they’re uncorrelated, and is typically the case. Suppose that service times are exponentially distributed, which is also typical in most systems such as database servers. As a result, completion times shouldn’t be correlated either. But if there is lock contention inside the server, some requests will block others, and you will see completion times cluster together and become correlated. In general, if there is a drop in completions for some period of time, followed by a cluster of completions, there is a decent chance that the first completion in the cluster was a request that blocked all of the others by holding a resource they needed.

I could actually go on. There’s just so much you can do with arrival and completion times. Some things require you to know which arrivals and completions belong in pairs, but you can do a lot even without knowing that. Maybe some people can post their favorite techniques in the comments.

I’ll show you one technique that is interesting, and doesn’t even require that you know which arrivals and completions are paired up. This is item #2 in my numbered list above. Let’s suppose that arrivals are uncorrelated and randomly distributed, and service times are exponentially distributed. This is typical. In other words, there will be a steady but variable “white noise” of incoming requests, and most requests will take about the same time to process and send back the response, with a “long tail” of requests that take longer than normal to complete. Suppose that the typical response time is under a millisecond — maybe it’s 50 microseconds, maybe a bit longer. This is common in MySQL servers I work with when they’re running a simple read-mostly workload with most data in memory.

If we slice the arrivals and completions into buckets, a second at a time, and count how many there are in each bucket, we should see roughly the same number of arrivals and completions, would you agree? Suppose the server handles 20k requests per second. A few of the requests will arrive in one second and complete in the next second, so at the end of the interval we’ll see a couple of extra arrivals, but at the start of the interval we’ll also see about the same number of extra completions from the previous interval. And anyway, the number of requests that span the transition between seconds will be small.

So if we count the arrivals and completions in each second, and plot them as curves in a time series, the two curves should match pretty closely. What if we narrow the observation windows and aggregate every tenth of a second? It’ll probably look pretty similar. But if we continue to aggregate at a finer and finer level, at some point two things will happen: 1) the curves will become jaggedly different from each other due to the random differences in completion times, caused by differences in response times; and 2) the curves should still look roughly like each other, but the completions will be sort of an echo of the arrivals, lagged by about the amount of the typical response time (which will usually be clustered tightly, given that this is an exponential distribution).

And now for the interesting part: what if there are parts of the data where this isn’t the pattern we see?

Let’s look at some real data and see what this looks like. I’ve grabbed some TCP packets from a real server, and merely classified the packets as inbound or outbound based on the IP address. I haven’t cleansed them, or ensured that I have a single outbound packet for every inbound one, or anything like that. All I’m working with here are packet counts and directions, and the timestamps. I’ll pretend that an inbound packet is an arrival, and that an outbound packet is a completion. This isn’t really true, but after you see the plots, you’ll agree that it’s close enough for government work. First, I’ll aggregate arrivals and completions in one-second buckets:

It’s a little hard to see because of the pixelation, but the curves follow each other pretty closely, with the red peeking out from behind the green here and there. If I aggregate this data in tenths of a second, I get the following plot:

That’s getting pretty crowded, but just trust me when I say that you can zoom into it and the curves still match pretty closely. If I aggregate it by hundredths of a second, the plot is completely unreadable unless I zoom in on it. I’ll spare you the zoomed-out look and simply zoom into one particular region that I found to have something interesting:

You can clearly see that there is a spike of inbound packets, which by itself might not be noteworthy. What’s interesting is that the outbound packets don’t follow the same spike. Instead, there’s a spike of outbound packets a few hundredths of a second later. This is the “echo” effect that I mentioned before.

What are we looking at? It might seem that this is a small server stall due to a lock or some other queue, followed by the logjam clearing out, but that’s not really what happened. The server didn’t stall, it merely had a number of queries with longer service times arrive in a clump. If it had stalled due to some lock, the outbound packets would have actually dropped off sharply, leaving a big notch in the green curve, followed by a spike in the green curve. If the stall was long enough, the red curve (inbound) would have also dropped off sharply, because all the application servers that would normally be sending queries to the server would be stopped, waiting for existing queries to complete. That didn’t happen.

What probably happened here was something like a small cache stampede. The application sent a burst of extra queries (not just the normal number of queries), and these ran longer than usual. This is not evidence of a misbehaving server, but rather looks more typical of a misbehaving application.

We see this echo effect because the aggregation interval is a lot shorter than the response time of those queries. And we don’t see it in the rest of the curve because the typical response time is a lot shorter than a hundredth of a second. We’re straddling the two worlds with the aggregation interval. So as we narrow the aggregation interval, we’ll start to find the place where we get some “static”, sort of like de-tuning the radio just a little bit (if you remember analog radios… ah, when you could build your own from a Radio Shack kit).

What does this mean? It means that I need to go back to the customer and ask whether they care But in general, most applications are going to show this kind of pattern if you look closely enough. Most workloads aren’t completely uniform, and as you slice it more finely, you see jagged edges. Most servers have micro-stalls (which I didn’t show here), and all you have to do is look closely enough to see them. One question is whether it’s bad enough to matter. Real systems have queueing — it’s just the way the world is. Arrivals clump up, response times vary, and queueing happens. You can certainly obsess over this too much if you want to. On the other hand, maybe there are some interesting questions to ask. For example, can we discover these aberrations with a program, instead of by looking at a graph? This is often harder than it seems. But if we can do that, is there some benefit in tracking the roughness and echo-effect of the graph over time, as a performance early warning? Perhaps there is.

I won’t dig deeply into it in this blog post, because it’s too long already. My intention was to show just one of the things you can do with arrival and completion data (actually, just an approximation of that data, not even the real thing), without even doing simple things such as figuring out which arrivals and completions are paired and belong together. And I wanted to show that a huge amount of information can be extracted from these simple input metrics.

I would be remiss if I didn’t also mention that there’s a lot of information we’re still lacking, which we’d often like to have. What was the query, for example? With the text of the SQL in hand, we might be able to answer questions about what’s happening in that little spike. We could also classify queries together and analyze the classes of queries, rather than lumping all requests into one bucket. These are the types of tasks that mk-query-digest is good at doing, and there is merit in gathering more data and analyzing a richer dataset. But it’s nice to know that even with a very simple set of data, we can learn a lot of complex and useful things about a system. Arrival and completion events are the only two things you need to know to derive pretty much everything else about a system’s performance (via measuring response times), scalability (via the four fundamental metrics I mentioned in the previous blog post), and much more.

If you come to Percona Live in New York in a few weeks, I’ll be speaking about this — and I’ll go far beyond where I left off in this blog post.

1. Baron Schwartz says

Actually, there is a set of several queues in front of the database: the networking buffers and devices between client and server, terminating with TCP/IP kernel buffers, which is probably where most of the queueing takes place. Now I need to learn more to determine at what point in time tcpdump sees the packets.

2. Baron Schwartz says

Yes, I agree. Sometimes when there is something like HAProxy or a connection pool in front of MySQL, there is a sharper line between queueing and servicing, but that’s at a higher level than just the server alone.

3. Peter Zaitsev says

Baron,

In case you define it just as when request enters the server when why would being the only request matters ? How multiple executing request from user standpoint are different from background activities executed by database engine or operating system ?

I believe Service time vs Response time definition is the most helpful when system has some from of explicit queue. For example if we look at IO scheduler there is the time request enters the queue, the time it is submitted for execution and the time it is completed which allows us to see the queueing in some meaningful way. At the same time we know multiple requests can be submitted to device at the same time and internal queueing can happen, this means the service time in terms of RAID controller will be different from low level service time by drive (head movement and read/write for given request)

4. Baron Schwartz says

The formal definition of the service time is the time consumed once the request enters the server, regardless of queueing on subsystems. But I agree with you, ideally the service time we’re looking for is that which we would get if all systems and resources the request needs to exercise were instantly available.

However, even then, we have variations. Assume the system is completely idle and we run a query — suppose it hits the disks to fetch some blocks. Run it again and the caches are warm, and the code path is different. No queueing in either case, just different code paths, but there will be different service times.

5. Peter Zaitsev says

Baron, You mention about response time vs service time and speak about measuring it when there is only one request going. I think it is more complicated than that in practice because even if you just have one request from user standpoint Innodb might happen to be doing IO to the same hard drive doing for example insert buffer merge. The hardware itself may queue your request because it is doing some internal work at this point. if you take large enough magnifying glass you’ll see what you can’t detect all cases of queuing to know the true service time.

This is why in most cases we deal with not real service times but response times with some known queuing removed, which can be things like waiting on table locks, row level locks etc.

6. bob says

I’m already logging send and completion time from within the app when my benchmarking routine kicks off (1/10000 pages). I suppose it would be fairly trivial to awk and gnuplot the data. Hmmmm….