October 25, 2014

Is your server’s performance about to degrade?

I’ve been talking and writing a bit lately about the scaling problems I’m seeing on fast servers running lots of queries. As a rough guide, I’m seeing this in servers running 20k queries per second and higher, lots of memory, lots of CPU cores, and most queries are running faster than one millisecond; some in the 50 to 100 microsecond range.

At such speeds, how would you know if your server froze for half a second? This is not a hypothetical question. If your server freezes routinely for half a second, then it’s occasionally going to freeze for much longer than that. Knowing whether it’s freezing routinely for short bursts is important to know. Add just a bit more data, or add a bit more load, and suddenly things can drop off a cliff.

I had a customer a while back whose server suddenly took a 50x performance hit, overnight. He didn’t know of anything that had changed. Reducing the load by half through the load balancer did not restore proper response time, so it wasn’t load. It was probably crossing a tipping point in the data size, or something like that. It suddenly caused the query cache to become an enormous bottleneck.

The query cache is an easy villain. There are many others. There are mutexes that protect global things like the thread counter and other shared data.

These are going to get more attention as people run into them more. I feel really hopeful that Oracle will solve these issues, and it won’t be a moment too soon.

In the meantime, how do you know if this is happening to you? Here’s how I diagnosed it on one customer’s server. The server is running a Percona build of 5.0.84, so I have microsecond logging in the slow query log, and extended statistics in the same log. I logged a couple gigabytes of queries at 0-second threshold (all queries) and crunched it with mk-query-digest. Side note: bravo for Daniel figuring out how to make mk-query-digest parse 2GB of slow query log with only 37MB of memory, and for our generous sponsor who paid for that work!

From this report, I was able to see queries that have mostly very fast performance, and occasionally have much worse, and — here’s the magic — the byte offset of the worst performer in the log. Then I just peeked back into the raw log and looked near that location.

Here’s what I found:

Suddenly these queries, which don’t touch any tables, are all taking almost exactly the same (very long) time to execute. To give some context, normally these take 75 microseconds. I can see it clearly when I scan the slow query log: there are just places where the log is full of these queries all taking 0.24 seconds, or all taking 0.30 seconds, or whatever. It’s obvious to me that they are all waiting for something, and then they suddenly get what they’re waiting for and complete. I’d be able to do more interesting analysis if the timestamp of the query itself was written in the log with microsecond granularity, too.

It’s interesting to note that while these Quit; and Statistics; and similar no-tables, no-storage-engines queries are varying from 50 microseconds to half a second, *real* queries that actually touch real tables and fetch data from them (with functions like NOW() that disable the query cache) are executing in no more than 12 microseconds, worst-case. So the contention that’s blocking these queries for shockingly long times is not to be blamed on InnoDB, it’s purely in the server layer. There’s a chance I’m wrong on that; it’s asking a lot to know all this code, and I don’t know it all, but I’m 80% confident that Quit; is purely in the server level.

This is rather hard to catch in action as it happens, but we’re thinking about ways to do just that. Catching a server having micro-freezes, and capturing diagnostic data just-in-time, is a poor man’s way to identify what’s going on. It’s possible that nothing can be done about it — and that’s where improvements to the MySQL source code are necessary. But in many cases something can be done, such as disabling or reconfiguring some part of the server.

Interestingly, the above is happening on a server that’s really NOT running that many QPS — less than 100 QPS on average, and the server is mostly idle. So just because the server isn’t pushing the limits of performance doesn’t mean it’s immune to these kinds of problems.

About Baron Schwartz

Baron is the lead author of High Performance MySQL.
He is a former Percona employee.

Comments

  1. Also keep an eye out for RAID cards testing their battery — they do that by disabling the battery-backed write cache in order to test-drain the battery. Can wreak havoc on a server “randomly”.

  2. I’ve also witnessed in network dependent topologies such as with NDB or replication setups that flaky Ethernet connectors or bad wiring can also cause non-deterministic lag. Definitely a good idea to keep a close eye on latencies as well. We were able to significantly bring up QPS’ in a client setup by minimizing redundant intermediate network devices and replacing suspect wiring segments.

  3. Flavian says:

    @Baron Schwartz

    Im having a similar problem… im running mysql 5.1.32 innodb + Glassfish on the same server….

    The problem is that at the start of the transaction the process of the select insert update runs fast but after few seconds the insert takes nearly takes upto 1 sec for each query…. and it only for insert… coz select and update are running pretty smoothly… i 1st assumed that it was the index problem while inserting so i removed all the index and ran the transaction again… but still no luck.. i just gained -1min difference with no index…

    can you please suggest whats wrong… And im using n number of tables… but the insert is only on 1 table.

  4. Arjen says:

    We had an entirely different reason to see freezes. We use a ext4 filesystem and have binlog enabled to facilitate replication. Ext4 (at least with default settings) doesn’t flush the data to disk as often as ext3 does, but when you explicitly sync data to disk, it might have to flush up to the entire file-size of data to disk.
    Afaik this behaviour has changed a bit over time, but when we were hit by it… it used to flush the entire 1GB binlog from memory to disk at the moment MySQL issued the sync. And that sync is by default only issued inside the binlog-rotation code, within the global logging-lock. I.e. it could stall several seconds because the lock wasn’t released wich was caused by the very long sync-call.

    Obviously with faster syncs (i.e. because you sync your binlog more often or with another FS) the freezes will be much shorter, but they still may occur.

  5. Flavian, we are happy to help you with consulting, or in forums.percona.com.

    Arjen, there’s been a merry fight around the ext* filesystems, which has gained the notorious name “The great fsync() bug” — its behavior is terrible. For the last couple of years I have recommended xfs without hesitation. You will see that’s what Vadim runs all his high-performance benchmarks on too. It’s hard to contemplate a high-performance database server on anything else these days (on Linux, I mean; there are other choices on different OSes).

  6. I believe it’s been mentioned in a post by Peter before –

    One graphed stat I like to read surrounding these sorts of problems is logical rows read/second (one of the last stats in SHOW INNODB STATUS).

    If performance is suddenly much worse (but these numbers are remaining about the same), then it may be time investigate how the access pattern changed or if just a little bit of data growth caused far more cache misses.

  7. Mrten says:

    Baron,

    I decided against XFS (picking JFS) for our servers when I bumped into these two bugs:

    https://bugzilla.redhat.com/show_bug.cgi?id=227331
    https://bugzilla.redhat.com/show_bug.cgi?id=240077

    I know those are from 2007, but since you’re recommending XFS ‘without hesitation’ I thought I’d ask: Have you ever (more recently) run into problems like those two bugs with XFS/LVM/MD? Or is it that I should just give up on LVM for my db-servers? :)

    thanks for any insight.

  8. I used to be a big fan of LVM, but that changed when a) Vadim’s benchmarks convinced me that it is more costly than I thought, and b) now there’s XtraBackup. I still think it’s nice technology, but I don’t think it’s vital anymore.

    I think that bugs from 2007 should not deter you. I cannot recall ever seeing problems with XFS.

  9. Rob Wultsch says:

    @Baron Schwartz
    Which benchmarks in particular?

  10. Vojtech Kurka says:

    I’ve faced such freezes too. It took me quite a long time to find that DROP/ALTER table on InnoDB tables freezes the whole innodb engine for a few seconds. It’s shorter with small buffer pool, but for 40GB it takes about 2 seconds and the load doesn’t matter.

    Just look at bugs http://bugs.mysql.com/bug.php?id=51325 and http://bugs.mysql.com/bug.php?id=51325
    The fix is in 5.1.46. However, I haven’t tested it yet.

  11. Jon says:

    Actually, in my experience, you shouldn’t touch xfs on top of md or lvm. Only use it on top of hardware RAID. I’ve had multiple cases of worse bugs than the ones mentioned earlier with xfs on md, causing massive data loss. If you are in the position to run an more recent kernel than what comes with Centos 5.x, things might be better….

  12. slavik says:

    maybe this small tool will be useful to detect such micro-freezes, i wrote this to detect on vm/vds when host overloaded
    http://sacc.cybersec.ru/files/lagtest.c

Speak Your Mind

*