The four fundamental performance metricsBaron Schwartz
There are many ways to slice and aggregate metrics of activity on a system such as MySQL. In the best case, we want to know everything about the system’s activity: we want to know how many things happened, how big they were, and how long they took. We want to know precisely when they happened. We want to know resource consumption, instantaneous status snapshots, and wait analysis and graphs. These things can be expensive to measure, and even more importantly, they can require a lot of work to analyze. And sometimes the analysis is only valid when very stringent conditions are satisfied. What metrics can we gather that provide the most value for the effort?
There’s a relationship known as Little’s Law, sometimes also referred to as the fundamentals of queueing theory, that lays out some simple properties between various metrics. One of the properties of Little’s Law is that it holds regardless of inter-arrival-time distributions and other complex things. And you can build upon it to create surprisingly powerful models of the system.
Let’s begin with a query. It arrives, executes for a bit, and then completes.
Then some time elapses, and another query arrives, executes, and completes.
To add the minimal amount of extra complexity, let’s say that another query arrived while query 2 was executing, and completed after query 2 completed. Now we have the following picture, fully annotated. The arrivals and completions are denoted by “a” and “c”.
If we start at the left end of the Time arrow, and call it the beginning of the observation period, and observe the system until the right end of the arrow, we have our observation interval. If we travel from left to right along the arrow, we can count the system’s “busy time” as the time during which at least one query is running. This is the period from 1a to 1c, plus the period from 2a to 3c.
It’s easy to compute this by just creating two counters and a last-event timestamp. Every time there is an arrival or completion, look at the counter of currently executing queries. If it is greater than 0, add the time elapsed since the last arrival or completion. Then increment the counter of currently executing queries if the current event is an arrival, and decrement it if it’s a completion. Finally, set the last-event timestamp to the current time.
Now, if we add another counter, we can get the “weighted busy time.” This counter is similar to the busy time, but instead of just adding the time elapsed since the last event, we multiply the time elapsed by the number of currently executing queries and add it. This is effectively the “area under the curve” — the blue-colored regions in the picture above.
Finally, we can simply count the number of arrivals or completions. It doesn’t matter which; Little’s Law is based on a long-term average, where the two are equal.
All together, we have the following four fundamental metrics:
- The observation interval.
- The number of queries in the interval.
- The total time during which queries resided in the system — the “busy time.”
- The total execution time of all queries — the “weighted time.”
We can maintain these counters in the system, and sample them at will by simply looking at the clock and writing down the three counters at the start and end of whatever observation period we choose.
What is this good for? We can use these together with Little’s Law to derive these additional four long-term average metrics:
- Throughput: divide the number of queries in the observation interval by the length of the interval.
- Execution time: divide the weighted time by the number of queries in the interval.
- Concurrency: divide the weighted time by the length of the interval.
- Utilization: divide the busy time by the length of the interval.
This is extremely powerful. These are the key metrics we need to model scalability with the Universal Scalability Law, perform queueing analysis, and do capacity planning. And we obtained all of this without needing to analyze anything complicated at all. This is just addition, subtraction, multiplication, and division of some constantly increasing counters!
There’s more. We derived our four long-term average metrics from the four fundamental metrics, and we derived those in turn from the arrivals and completions. In a future article, I’ll show you what else we can learn by observing arrivals and completions. And I’ll present on this at Percona Live, where I’ll show you how to tie it all together quickly to answer real-world questions about practically any system, whether it’s instrumented with the counters I mentioned or not, with shockingly little effort.