It is very common for me to look at a customer’s database and notice a lot of overhead from checking whether a database connection is active before sending a query to it. This comes from the following design pattern, written in pseudo-code:
function query_database(connection, sql)
if !connection.is_alive() and !connection.reconnect() then
Many of the popular development platforms do something similar to this. Two things are wrong with this code: 1) it doesn’t actually work and 2) it has a large performance overhead.
It Does Not Work
This code doesn’t work because of a race condition. If the connection is alive when checked, there’s no guarantee that it is still alive by the time connection.execute(sql) is called. And if it’s not alive and gets reconnected, there’s still no guarantee.
It’s useless to check and then execute. Instead, the code should be rewritten something like this:
function query_database(connection, sql, retries=1)
catch InactiveConnectionException e
if retries > 0 then
retries = retries - 1
Notice that I completely removed the call to is_active(). If the connection is active, the query will succeed; if not, it’ll fail, reconnect, and retry.
This code also has the nice property that it also allows you to retry things like lock wait timeouts and deadlocks, if you so desire. These are areas where many applications can benefit a lot, in my experience. Most applications should simply retry these conditions, but instead they don’t handle them at all.
Checking whether the connection is active typically involves a call to the ‘ping’ or ‘statistics’ command at the MySQL protocol level, which will increment Com_admin_commands in SHOW GLOBAL STATUS, or a trivial query such as ‘SELECT 1′ which is usually harder to diagnose. This has a very high cost in many applications. There are two costs: the cost to the application of the network round-trip plus the query execution time, and the added load on the database server. The added load on the database server can be very high. A few days ago I saw a Ruby on Rails app where the ‘Administrator command: statistics’ command was taking over 40% of the server’s total query time. Eliminating the useless connection checks could have nearly halved the load on this database server. That’s not unusual!
When the application’s queries are long, the extra query is lost in the noise. But high-traffic apps put tremendous effort into getting query times down, and some highly tuned apps are worried when their queries take longer than a millisecond or so. When you’re running 20k queries per second against your database server, an extra 20k queries per second to check whether the connection is alive matters a lot. Those ‘statistics’ or ‘ping’ queries are nearly as expensive as the actual queries the application wants to run!
And that’s just the load on the database server. On the application side, you essentially see doubled query latency. Every time you want to run a query, your app framework is doing a network round-trip to the database to check the connection, then another network round-trip to run the query. Again, this matters a lot.
The problem is that the bad pseudo-code I showed above penalizes the common case in favor of the uncommon case. Connections are usually alive and don’t need to be pinged and reconnected. A better approach is to use the same code that fixes the race condition. Again, if the connection is down, you’ll find out when you try to run the query. Until then, assume everything is OK and just run the queries.
I hope that the upstream maintainers of the offending libraries can find and solve these issues, because it’d be a big help to apps when they grow. It’s one of those things that works fine in the lab, and even in the field, until performance really starts to matter — and then it sticks out like a sore thumb.
Here’s another example of the impact of these silly queries:
# Rank Query ID Response time Calls R/Call Item
# ==== ================== ================ ====== ======== ===============
# 1 0x5E796D5A4A7D1CA9 10651.0708 73.1% 120487 0.0884 ADMIN STATISTICS
# 2 0x85FFF5AA78E5FF6A 1090.0772 7.5% 23621 0.0461 BEGIN
# 3 0x6E85B9A9C9FF813E 868.0335 6.0% 6923 0.1254 UPDATE scores
# 4 0xA3A0423749EC0E37 851.0152 5.8% 6020 0.1414 UPDATE user_datas
# 5 0x813031B8BBC3B329 822.0041 5.6% 23299 0.0353 COMMIT
# 6 0xA873BBC4583C4C85 278.4533 1.9% 6985 0.0399 SELECT users user_devices
That’s right, 73% of the server’s load is consumed by checking to see if the connection is still alive