We’ve published our first case study. The customer, ideeli, had a database that was struggling on standard MySQL and InnoDB. The big win was the upgrade to XtraDB. The business continued to grow quickly, and months later under much more traffic, the database is still outperforming their previous version.
I thought I’d write a few notes that didn’t seem appropriate to include in the case study, because this was a fun project that might be interesting to readers.
As usual, it was all about diagnosing the problem correctly. I used a variety of tools to help with this, foremost among them “stalk” and “collect” from Aspersa. There were several problems, not just one, and they required different techniques to diagnose. This can be hard when the problems are happening sporadically and/or mixed together. You really need to be disciplined and collect data, data, data. If you are not sure about the cause of something, you don’t have the right data. Maybe you have too little, or too much, or you have the signal mixed in with the noise. Knowing when and how to get and interpret good diagnostic data is easily 95% or 98% of the work in a case like this. All I had to do was wait until the problem happened, look at the diagnostics, and a couple minutes later I had my answer.
What were the problems? The query cache was causing both mutex contention and excessive CPU usage, for different reasons, and I found different problems in different samples. InnoDB was also dying under mutex contention. Each spike of slow queries I found was caused by different things. Sometimes GDB stack traces showed InnoDB mutex contention, sometimes oprofile showed the query cache hogging the CPU, and so on. So we had to solve all the problems, not just some of them.
The graphs of query traffic and response times were from data I gathered with tcprstat. I also used the data from tcprstat to analyze the variation in query response time. One-second intervals is a relatively fine granularity, but at that level you can see better when micro-freezes are occurring. I used ad-hoc slow-query-log analysis with awk and other tools to discover and investigate unusual patterns, and figure out whether queries were causes or victims of performance problems. The problems here were not caused by queries, but query behavior was the symptom that we could observe, so all of the above analysis was useful for detecting the problem as it happened, and verifying that it was not still happening after we implemented fixes.
New Relic was a very helpful tool in this case, too. If you don’t use New Relic, you might try it. (We don’t get paid to say that.) Their tools are really nice.
I also want to mention that this database’s problems were entirely inside the database software itself. The ideeli team had already done a great job with indexing, query optimization, and so forth. Nothing more could be done without fixing these problems inside MySQL and InnoDB at the source code level, or changing the application architecture.
All things considered, the database server’s performance is not as high as many I’ve worked on, so the absolute numbers of queries per second may not look impressive. However, remember that this database is running on an EC2 server. EC2 has relatively slow virtual CPUs, and given that and the workload this server is under, it does very well. Of course you could far exceed that performance on a “real” server.
This case illustrates why average-case performance isn’t a good metric. As Peter says, “the average temperature of patients in the hospital isn’t important.” Good performance means avoiding outliers and variations. The query performance needs to be a) fast, and b) fast all the time, and c) the same kind of fast all the time. Variations in performance at one layer introduce cascading variations at each higher layer. Rarely is a stuttering component’s misbehavior absorbed by other layers. Instead, every layer above it gets into trouble.
That, among other things, is why the database has a much harder job than people sometimes realize, and why it’s so hard to write a good database server.