October 22, 2014

Checking for a live database connection considered harmful

It is very common for me to look at a customer’s database and notice a lot of overhead from checking whether a database connection is active before sending a query to it. This comes from the following design pattern, written in pseudo-code:

Many of the popular development platforms do something similar to this. Two things are wrong with this code: 1) it doesn’t actually work and 2) it has a large performance overhead.

It Does Not Work

This code doesn’t work because of a race condition. If the connection is alive when checked, there’s no guarantee that it is still alive by the time connection.execute(sql) is called. And if it’s not alive and gets reconnected, there’s still no guarantee.

It’s useless to check and then execute. Instead, the code should be rewritten something like this:

Notice that I completely removed the call to is_active(). If the connection is active, the query will succeed; if not, it’ll fail, reconnect, and retry.

This code also has the nice property that it also allows you to retry things like lock wait timeouts and deadlocks, if you so desire. These are areas where many applications can benefit a lot, in my experience. Most applications should simply retry these conditions, but instead they don’t handle them at all.

Performance Overhead

Checking whether the connection is active typically involves a call to the ‘ping’ or ‘statistics’ command at the MySQL protocol level, which will increment Com_admin_commands in SHOW GLOBAL STATUS, or a trivial query such as ‘SELECT 1′ which is usually harder to diagnose. This has a very high cost in many applications. There are two costs: the cost to the application of the network round-trip plus the query execution time, and the added load on the database server. The added load on the database server can be very high. A few days ago I saw a Ruby on Rails app where the ‘Administrator command: statistics’ command was taking over 40% of the server’s total query time. Eliminating the useless connection checks could have nearly halved the load on this database server. That’s not unusual!

When the application’s queries are long, the extra query is lost in the noise. But high-traffic apps put tremendous effort into getting query times down, and some highly tuned apps are worried when their queries take longer than a millisecond or so. When you’re running 20k queries per second against your database server, an extra 20k queries per second to check whether the connection is alive matters a lot. Those ‘statistics’ or ‘ping’ queries are nearly as expensive as the actual queries the application wants to run!

And that’s just the load on the database server. On the application side, you essentially see doubled query latency. Every time you want to run a query, your app framework is doing a network round-trip to the database to check the connection, then another network round-trip to run the query. Again, this matters a lot.

The problem is that the bad pseudo-code I showed above penalizes the common case in favor of the uncommon case. Connections are usually alive and don’t need to be pinged and reconnected. A better approach is to use the same code that fixes the race condition. Again, if the connection is down, you’ll find out when you try to run the query. Until then, assume everything is OK and just run the queries.

I hope that the upstream maintainers of the offending libraries can find and solve these issues, because it’d be a big help to apps when they grow. It’s one of those things that works fine in the lab, and even in the field, until performance really starts to matter — and then it sticks out like a sore thumb.

(Edit)

Here’s another example of the impact of these silly queries:

That’s right, 73% of the server’s load is consumed by checking to see if the connection is still alive :-)

About Baron Schwartz

Baron is the lead author of High Performance MySQL.
He is a former Percona employee.

Comments

  1. Hi Baron!

    “If the connection is active, the query will succeed; if not, it’ll fail, reconnect, and retry.”

    uhm, looks like you forgot to actually retry – I’d expect the catch block to contain a recursive call to query_database read:

    catch InactiveConnectionException e
    if retries> 0 then
    retries = retries – 1
    connection.reconnect()
    query_database(connection, sql, retries=1)
    else
    throw e
    end
    end

  2. Ingo says:

    Hi,

    I second that! We are using apache commons-dbcp for pooling connections to a mysql database. After setting the config “testOnBorrow” from “true” (default) to “false” the load on the database server dropped dramatically. This config specifies if a connection test should be done before using a connection for the actual query.

    So at least for commons-dbcp, although switched on by default, it’s quite easy to get rid of this behaviour, by just setting “testOnBorrow” to “false”.

    Ingo

  3. jmartin says:

    I can understand that applications really shouldn’t do that – what about monitoring apps like Nagios?

  4. Perrin Harkins says:

    I agree that checking every time you run a query is wasteful. However, checking once at the beginning of each request seems pretty reasonable. On websites that don’t have continuous high traffic, there will often be lulls, allowing some of the connections to time out. Checking once at the beginning of each request fixes this problem.

  5. Roland, the ‘while true’ block avoids recursion inside an exception handler ;-)

  6. Rob Wultsch says:

    @Baron
    Is the title ripped off from the current PG Hackers thread or just a coincidence?

  7. Coincidence. I am way behind on pg mailing lists :-(

  8. @Baron – darn – missed that :(

    Thanks.

  9. @Ingo, @Baron,
    Another method is to do a periodic eviction of non working connections.
    It’s not entirely safe; but I find it does the job very well. It is optimistic, of course.
    Yet another higher level test would be that if some server fails (e.g. it’s a slave, and it’s too far behind; or perhaps it does not respond at all), you can collectively remove from pool all connections to that server.

  10. waterguo says:

    retry doesn’t work for 3 reasons: (1) it breaks transaction (2) large number of retires kill app server (3) platform dependent, in general developers hate to write apps only works on one type of database

    and the example used in this blog actually hints a case of bad programming. supposedly after a tested connection is retrieved from connection pool, a good programmer will use it for 10 or more things before returning it to pool so the overhead of testing becomes marginal. getting the connection from the pool for every single sql is bad practice.

    test connection is not fail-proof but it helps a lot on reducing the number of failures.

    yes you can ask the app server not to test the connection. then the trade off is very likely you will see an increase of failed operations. imaging the admin restart your database or reset the router/firewall, all of sudden you see hundreds of failures. this is not acceptable in most organizations.

  11. Andy says:

    Baron.

    Which libraries have you seen this problem in?

  12. waterguo, I think it is easy to over-generalize. In many cases I see apps that follow exactly the patterns you are talking about — get a connection, use it for lots of things, release it. But my point here is that the underlying framework is making assumptions on behalf of the app, which turn out not to be even remotely true for lots of apps, and I mean LOTS. The framework should not be forcing such decisions on the user. It was nice to see in one of the comments an example of a framework that permits configuration, which is good.

    And in MySQL, you are wrong; retry does not break the transaction. (It is true in versions of MySQL that reached end-of-life many years ago, but not true anymore.) Retrying lock wait timeouts and deadlocks is the right thing to do in any sanely recent version of MySQL. Your point #3, that a lot of retries kills the app server, is missing the point: one retry is all that’s needed, and app servers are cheap and easy to scale, but database servers are very expensive and hard to scale.

  13. Andy, whatever Ruby On Rails is doing by default is a glaringly easy example, but I have not dug into the code in a while, so I have forgotten where it is. A number of our clients have patched it to fix that, but I haven’t kept track of those patches.

    I’ve also seen this in Perl apps based on Class::DBI or similar, and I think the culprit was Ima::DBI.

  14. Perrin Harkins says:

    Good point about Ima::DBI. I can fix that pretty simply for mod_perl-based apps and make a new release. If anyone knows how to fix it generically for other runtime environments like FastCGI, let me know.

  15. I am not sure that Ima::DBI is to blame. Does the generated closure from _mk_db_closure get called for *every* query, or is that called only when a new connection is created? Maybe I should discuss this on a mailing list, but I hate to subscribe just for this :-)

    I’m glad that you are interested in changing the code. Now all I have to do is figure out who maintains the Ruby code that causes the phenomenon I see in Ruby On Rails, beginning with “figure out what code causes it” :-) and then contact that person and suggest the change. Or maybe someone else can do this.

  16. axl says:

    yes please, which code in rails is responsible and, assuming it’s not going to get patched any time soon (if ever), how have people worked around it?

  17. Peter Zaitsev just asked for clarification on an internal Percona mailing list, and I thought I’d post my reply here.

    MySQL used to roll back the whole transaction on a lock wait timeout, but since 5.0 it only rolls back the latest statement:

    http://dev.mysql.com/doc/refman/5.0/en/innodb-parameters.html#sysvar_innodb_rollback_on_timeout

    There are cases where it rolls back the whole transaction (deadlock is one). I don’t think we need to care about explicit ROLLBACK, only what happens when there is an error.

    I would not lump together the “deadlock/timeout” retry behavior with the “connection died” behavior. They should be separate.

    At my last job I remember that we created a derived class of Ima::DBI (or whatever the responsible code is, I can’t remember now) and overrode its behavior to do what we wanted. Having retries handled consistently in the DB access layer can save lots of application code. The key is to make it configurable! As I remember, we handled timeout, deadlock, and other cases like warnings that we wanted to escalate to errors without changing SQL_MODE (so we could enable strictness in dev and run our test suite; changing those things in production would have broken everything). The code ended up being quite complex — at least 80 or 100 lines I think, with quite delicate logic. Definitely better to be centralized.

  18. Kim Carlsen says:

    Deadlocks and dead connection are difficult to handle in the DB layer as they will rollback the entire transaction and not just the last statement,
    so the DB layer would have to re-execute all statements in the whole transaction and assume that no result would change the execution path of the code.

    If you are only handling ‘lock wait timeout’ by retrying the transaction, wouldnt that be the same as increasing the timeout.
    E.g. trying the statement 3 times vs trying the statement 1 time but wait 3 times longer for locks to be released?

  19. Brian Cavanagh says:

    If you fail on a transaction for a connection related issue, you DO NOT retry. You must investigate. A transaction that goes bad on your end may have succeeded, and if you just try again as an auto procedure it is very possible you may be damaging your relational model silently. I can remember many cases where this very same philosophy has destroyed customer relationships when their CC gets hit 10k times. I can say that it has happened to me at popular stores like Target at the checkout line. Every connection failure must be investigated (or work through case) by the app (or admin) to ensure relational integrity. You ignore errors at your own peril.

  20. Brian Cavanagh says:

    Hey Kim,

    Figure out exactly what your data is supposed to look like and then execute all the sql you need at once in one block. Otherwise, re-executing your transactions become problematic.

  21. Brian, Kim,

    Good points, made very clearly. Thanks for adding clarity.

    I was too vague when I said “Most applications should simply retry these conditions, but instead they don’t handle them at all.” I should have said that many applications that don’t really use transactions at all (again, there are LOTS of these, and most of them are the super-high-load ones I’m talking about) but use InnoDB for its other nice properties, are doing “single-statement transactions” in auto-commit mode by default. For the vast majority of such apps, they should retry.

    However, again I am not advocating that the DB layer should be forcing a “retry by default” on the application. It needs to be configurable, so the application can get the DB layer to do the right thing. And most of the apps I’m talking about should set the DB layer to retry by default.

    Note: even if I DID advocate a retry-by-default strategy, it’d be no worse than the current reconnect-if-dead strategy, which is going to break transactional requirements just as spectacularly in the same kinds of ways you’ve explained. If the connection goes away after I insert the item into the invoice line-item table, then I charge the credit card and insert the this-invoice-is-paid row, I’ve just inserted a payment for items that got rolled back and thrown away. Customers are even less happy about that kind of nonsense — but that’s exactly what Ruby On Rails might be doing today in a lot of cases.

  22. Ashley Martens says:

    This problem shows itself in Ruby on Rails if you are using a connection pool. When a connection is checked out of a pool it is checked if it is active. Under heavy load this starts to cause problems on the DB and the “admin statistics” command starts appearing in your slow query log.

  23. Great post. Thanks

    How did you generate the final query ranking table?

  24. corsair says:

    What was the query or program you used to get the ranking at the bottom of the post?

  25. That is part of the default report from mk-query-digest.

  26. waterguo says:

    Baron, I was trying to verify that mysql is able to resume transaction in reconnect. In JDBC standard, there is no mentioning of reconnect. In mysql jconnector document (http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-configuration-properties.html), I found a property called autoReconnect. Its purpose seems like identical to what you try to promote though I am not sure. But it also notes that after reconnect, old transaction is gone.

    Here is what I think. If reconnect is able to resume the session, would it make more sense to build this capability into driver? Because I cant think of any case that I don’t want to reconnect at app level.

    Secondly, I am trying to image how this is can be implemented on the server side. In order to achieve this, server code will need to remember the whole session which includes locks, dirty pages, cursors etc. And for how long? This feature won’t come without a price tag, will it?

  27. Short answer: if you disconnect, the transaction is rolled back.

  28. J says:

    Well written article.

    I think it is important to point out that the retry logic, as written, is incomplete and dangerous. The standard “best practices” for retry logic should include the following:

    * Back-off: Reducing the rate of accessing the downstream resource as reoccurring failures.
    * Bounded: At some point the retries should stop. :)
    * Jitter: Introduce randomness to ensure multiple callers aren’t getting into lock-step and pounding the downstream resource.

    Ignoring any one of these will inevitably turn some bad situations into even worse ones.

  29. I would consider this type of continual query attempt to be harmful in many instances. I prefer that applications simply error if the connection becomes stale or breaks — I would like not to have partial and corrupt data in most instances.

  30. Jay says:

    Roland,

    He does retry to connect as many times as specified in the parameter.
    Because of the while loop it will try to reconnect and requery

  31. Brigham says:

    Thanks for the interesting article, especially the info about what goes on in the server when you make a check. I think the check at any point is really useless for the most part. I would not perform a retry either because if the connection is down something is really wrong. I would report the error back to the consumer of the query and let it decide how to handle.

  32. Haile says:

    Yes, this is a general principle that applies outside databases. Checking whether a file exists before trying to open it is the classic example of LBYL and has the same problems as the database example.

  33. Syd says:

    Depends on the actual implementation & overhead. It may or may not be faster to manually check rather than relying on an exception, which can be heavy, depending on your platform.

  34. Syd, that’s not true. See above about “common case.” If the uncommon case (a dead connection) is rare, which it should be, then a “heavy exception” that happens once in a few million queries is going to be a lot cheaper than a “cheap check” that happens once every query.

  35. sys says:

    The Oracle database adapter shipped with the Sun (now Oracle) Java CAPS (Composite Application Platform System) actually used to get a sorted list of users in the database, for checking if a connection is alive.

    After much frantic yelling at Support we got it down to _merely_ doing something similar to select 1 from dual.

  36. Bottomless says:

    Shouldn’t there be some sort of sleeping introduced in the code? If so for how long? (or better a back-off algorithm as suggested by “J”)

    Otherwise it will just result in a storm of queries in a very short period of time, not allowing the system to recover from probably a temporary overload (and probably adding to the overload too) that caused the first query to fail in the first place.

    Thanks

Speak Your Mind

*