Traditionally the most benchmarks are focusing on throughput. We all get used to that, and in fact in our benchmarks, sysbench and tpcc-mysql, the final result is also represents the throughput (transactions per second in sysbench; NewOrder transactions Per Minute in tpcc-mysql). However, like Mark Callaghan mentioned in comments, response time is way more important metric to compare.
I want to pretend that we pioneered (not invented, but started to use widely) a benchmark methodology when we measure not the final throughput, but rather periodic probes (i.e. every 10 sec).
It allows us to draw “stability” graphs, like this one
where we can see not only a final result, but how the system behaves in dynamic.
What’s wrong with existing benchmarks?
Well, all benchmarks are lie, and focusing on throughput does not get any closer to reality.
Benchmarks, like sysbench or tpcc-mysql, start N threads and try to push the database as much as possible, bombarding the system with queries with no pause.
That rarely happens in real life. There are no systems that are pushed to 100% load all time.
So, how we can model it? There are different theories, and the one which describes user’s behavior, is Queueing theory. In short we can assume that users send requests with some arrival rate (which can be different in the different part of day/week/month/year though). And what is important for an end user is response time, that is how long the user has to wait on results. E.g. when you go to your Facebook profile or Wikipedia page, you expect to get response within second or two.
How we should change the benchmark to base on this model ?
There are my working ideas:
- Benchmark starts N working threads, but they all are idle until asked to handle a request
- Benchmarks generates transactions with a given rate, i.e. M transactions per second and puts into a queue. The interval between arrivals is not uniform, but rather distributed by Exponential distribution, with λ = M. That how it goes if to believe to the Poisson process.
For example, if our target is arrival rate 2 queries per second, then exponential distribution will give us following intervals (in sec) between events: 0.61, 0.43, 1.55, 0.18, 0.01, 0.76, 0.09, 1.26, …
Or if we represent graphically (point means even arrival):
As you see interval is far from being strict 0.5 sec, but 0.5 is the mean of this random generation function. On the graph you see 20 events arrived within 9 seconds.
- Transactions from the queue are handled by one of free threads, or are waiting in the queue until one of threads are ready. The time waiting in the queue is added to a total response time.
- As a result we measure 95% or 99% response times.
What does it give to us? It allows to see:
- What is the response time we may expect having a given arrival rate
- What is the optimal number of working threads (the one that provides best response times)
When it is useful?
At this moment I am looking to answer on questions like:
– When we add additional node to a cluster (e.g. XtraDB Cluster), how does it affect the response time ?
– When we put a load to two nodes instead of three nodes, will it help to improve the response time ?
– Do we need to increase number of working threads when we add nodes ?
Beside cluster testing, it will also help to see an affect of having a side on the server. For example, the famous problem with DROP TABLE performance. Does DROP TABLE, running in separate session, affect a response time of queries that handle user load ? The same for mysqldump, how does it affect short user queries ?
In fact I have a prototype based on sysbench. It is there lp:~vadim-tk/sysbench/inj-rate/. It works like a regular sysbench, but you need to specify the additional parameter
tx-rate, which defines an expected arrival rate (in transactions per second).
There are some results from my early experiments. Assume we have an usual sysbench setup, and we target an arrival rate as 3000 transactions per second (regular sysbench transactions). We vary working threads from 1 to 128.
We can see that 16 threads give best 99% response time (15.74ms final), 32 threads: 16.75 ms, 64 threads: 25.14ms.
And with 128 threads we have pretty terrible unstable response times, with 1579.91ms final.
That means that 16-32 threads is probably best number of concurrently working threads (for this kind of workload and this arrival rate).
Ok, but what happens if we have not enough working threads? You can see it from following graph (1-8 threads):
The queue piles up, waiting time grows, and the final response time grows linearly up to ~30 sec, where benchmark stops, because the queue is full.
I am looking for your comments, do you find it useful?