Often enough I find MySQL benchmark results where the difference between results is 1% or even less and some conclusions are drawn. Now it is not that 1% is not important – especially when you’re developing the product you should care about those 1% improvements or regressions because they tend to add up. However with such a small difference it is very important to understand whenever this is for real or it is just the natural variance for your baseline test.
Take a look at this graph:
Click the image for a larger view)
This is the result for a simple in-memory, read-only “select by primary key” SysBench benchmark on dedicated physical hardware that is otherwise idle, simple 1 socket system. I tried to stabilize it as much as possible, for example disabling CPU frequency scaling. But still I saw some 3% difference between “good runs” and bad runs.
What is the difference between those runs? Simply mysqld restarts.
Does this mean you can’t measure a smaller difference? You can by setting the appropriate test plan. Often having several runs makes sense, in others you need to make sure the system warms up before taking measurements or having benchmark runs that are long enough. Whatever method you use it is a good idea to apply your test methodology by conducting several runs of your baseline run to ensure the results are stable enough for your purpose. For example If I decided to do five 30-minute runs and average the results, if they all run within 0.1% I will consider 0.3% differences as meaningful.
Another practical trick that often helps me to separate real differences from some side effects is mixing the tests. Say if I have configurations I’m testing A and B instead of doing AAA BBB I would do ABABAB which helps with the case when there is some regression that can accumulate over time, such as with Flash.
You should also note that in modern systems there is almost always something happening in the background that can change performance – the SSD is doing garbage collection, MySQL (or Kernel) is flushing dirty pages, the CPU can even simply cool off and as a result being able to support Turbo-boost operations a little longer. So when you are stat running your benchmarks make sure you keep the idle time between runs the same – scripting benchmarks and iterating scenarios helps here.
Happy MySQL benchmarking!