I’ve been talking and writing a bit lately about the scaling problems I’m seeing on fast servers running lots of queries. As a rough guide, I’m seeing this in servers running 20k queries per second and higher, lots of memory, lots of CPU cores, and most queries are running faster than one millisecond; some in the 50 to 100 microsecond range.
At such speeds, how would you know if your server froze for half a second? This is not a hypothetical question. If your server freezes routinely for half a second, then it’s occasionally going to freeze for much longer than that. Knowing whether it’s freezing routinely for short bursts is important to know. Add just a bit more data, or add a bit more load, and suddenly things can drop off a cliff.
I had a customer a while back whose server suddenly took a 50x performance hit, overnight. He didn’t know of anything that had changed. Reducing the load by half through the load balancer did not restore proper response time, so it wasn’t load. It was probably crossing a tipping point in the data size, or something like that. It suddenly caused the query cache to become an enormous bottleneck.
The query cache is an easy villain. There are many others. There are mutexes that protect global things like the thread counter and other shared data.
These are going to get more attention as people run into them more. I feel really hopeful that Oracle will solve these issues, and it won’t be a moment too soon.
In the meantime, how do you know if this is happening to you? Here’s how I diagnosed it on one customer’s server. The server is running a Percona build of 5.0.84, so I have microsecond logging in the slow query log, and extended statistics in the same log. I logged a couple gigabytes of queries at 0-second threshold (all queries) and crunched it with mk-query-digest. Side note: bravo for Daniel figuring out how to make mk-query-digest parse 2GB of slow query log with only 37MB of memory, and for our generous sponsor who paid for that work!
From this report, I was able to see queries that have mostly very fast performance, and occasionally have much worse, and — here’s the magic — the byte offset of the worst performer in the log. Then I just peeked back into the raw log and looked near that location.
Here’s what I found:
|
1 |
<br># User@Host: webuser[webuser] @ [10.14.82.196]<br># Thread_id: 3329398998 Schema: app_db<br># Query_time: 0.306486 Lock_time: 0.000000 Rows_sent: 0 Rows_examined: 0 Rows_affected: 0 Rows_read: 0<br># administrator command: Quit;<br># User@Host: webuser[webuser] @ [10.14.82.196]<br># Thread_id: 3329398985 Schema: app_db<br># Query_time: 0.304674 Lock_time: 0.000000 Rows_sent: 0 Rows_examined: 0 Rows_affected: 0 Rows_read: 0<br># administrator command: Quit;<br> |
Suddenly these queries, which don’t touch any tables, are all taking almost exactly the same (very long) time to execute. To give some context, normally these take 75 microseconds. I can see it clearly when I scan the slow query log: there are just places where the log is full of these queries all taking 0.24 seconds, or all taking 0.30 seconds, or whatever. It’s obvious to me that they are all waiting for something, and then they suddenly get what they’re waiting for and complete. I’d be able to do more interesting analysis if the timestamp of the query itself was written in the log with microsecond granularity, too.
It’s interesting to note that while these Quit; and Statistics; and similar no-tables, no-storage-engines queries are varying from 50 microseconds to half a second, *real* queries that actually touch real tables and fetch data from them (with functions like NOW() that disable the query cache) are executing in no more than 12 microseconds, worst-case. So the contention that’s blocking these queries for shockingly long times is not to be blamed on InnoDB, it’s purely in the server layer. There’s a chance I’m wrong on that; it’s asking a lot to know all this code, and I don’t know it all, but I’m 80% confident that Quit; is purely in the server level.
This is rather hard to catch in action as it happens, but we’re thinking about ways to do just that. Catching a server having micro-freezes, and capturing diagnostic data just-in-time, is a poor man’s way to identify what’s going on. It’s possible that nothing can be done about it — and that’s where improvements to the MySQL source code are necessary. But in many cases something can be done, such as disabling or reconfiguring some part of the server.
Interestingly, the above is happening on a server that’s really NOT running that many QPS — less than 100 QPS on average, and the server is mostly idle. So just because the server isn’t pushing the limits of performance doesn’t mean it’s immune to these kinds of problems.
Resources
RELATED POSTS