Often I run into problems when trying to use mk-query-digest with tcpdump on “very” busy hosts. You might be thinking, “very busy is a relative and unquantifiable term,” and you’d be right, so I’ll phrase this differently. Let me give a little background to the problem first. Mk-query-digest tries to handle dropped or missing packets gracefully, but even so they can skew results dramatically. Imagine a situation where a single connection sends two queries and gets two responses, with a response time of R for each request, and a wait time of W between the requests. If the first response and second request are dropped by the kernel, the result – from mk-query-digest’s perspective – is that the database took 2R + W time to respond to the first request.
Back to the question of, “what is a very busy host?” In my experience, if you are getting even 5% of tcpdump packets dropped by the kernel, the results can be skewed enough to cause confusion about which queries are really slow. Recently, I got more than 60% dropped packets on a server with roughly 50MB/s of traffic on port 3306, system load of about 10, and 8 CPU cores. The resulting mk-query-digest output was obviously bogus when compared to the host’s slow-query-log (for example, none of the top 5 slow queries reported by mkqd appeared in the actual slow log file). After a little brain-storming, we came up with a few solutions:
#1 has an obvious flaw — if your long-query-time is 1 second, and mkqd believes that a query which actually took 10ms instead took 0.9s, the results are still useless. That is to say, this doesn’t actually solve the real problem of dropped packets, it just applies a mask to the output. #2 seems like the simplest good solution, but when I tested this, I still got very high percentage of dropped packets (around 30% when filtering only 4 out of hundreds of active clients). While this is lower than without the filter, it is still unusable. #3 actually worked very well and resulted in about 0.2% packet loss on this host, which is acceptable — the variances are statistically smoothed out and don’t noticeably affect the results. Here is the tcpdump command used.
|
1 |
tcpdump -i eth0 -s 65535 -x -n -q -tttt 'port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2' |
I also had to forward the tcpdump output to another host for processing because the database host couldn’t handle the additional IO or CPU pressure of either writing it to a file or piping it to mk-query-digest. Here is a draft of a script to automate this; use at your own risk and update to suit your needs.
|
1 |
#!/bin/bash<br># Set some defaults<br>d=$(date +%F-%T | tr :- _)<br>self=`basename $0`<br>REMOTEHOST='some.other.hostname' # CHANGEME<br>LIMIT=50                                  # limit # of queries in report<br>SLEEPTIME=1200                            # duration to capture tcpdump data in seconds (1200 = 20 min)<br>TMP='/tmp'<br>LOCKFILE="$self.lockfile"<br>TMPfile="$self.temp"<br>RESfile="$self.result"<br><br># check lock file<br>if [ -e "$TMP/$LOCKFILE" ]; then<br> echo "$self: lock file $LOCKFILE already exists, aborting"<br> exit 1<br>fi<br><br># set trap to be sure tcpdump doesn't run for ever<br># and clean up the temp file too<br>trap 'rm -f $LOCKFILE; kill $PID; ssh $REMOTEHOST -- "rm -f $TMP/$TMPfile"; exit' INT TERM EXIT<br>touch $TMP/$LOCKFILE<br><br># run the tcpdump & write to remote file and sleep for a bit<br>tcpdump -i eth0 -s 65535 -x -n -q -tttt 'port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2' 2>/dev/null <br> | ssh $REMOTEHOST -- "cat - > $TMP/$TMPfile" &<br><br>PID=$!<br>sleep $SLEEPTIME<br>kill $PID<br><br># set trap to be sure both remote files are removed<br>trap 'ssh $REMOTEHOST -- "rm -f $TMP/$TMPfile $TMP/$RESfile"; rm -f $LOCKFILE $RESfile; exit' INT TERM EXIT<br><br># digest the result, copy to localhost, then email it<br>ssh $REMOTEHOST -- "mk-query-digest --type tcpdump --limit $LIMIT < $TMP/$TMPfile 2>&1 > $TMP/$RESfile"<br>scp -q $REMOTEHOST:$TMP/$RESfile $RESfile<br><br># email $RESfile using your preferred transport<br><br># clean up remote and local files.<br>ssh $REMOTEHOST -- "rm -f $TMP/$TMPfile $TMP/$RESfile"<br>rm -f $RESfile $LOCKFILE<br><br>trap - INT TERM EXIT<br>exit 0 |
Resources
RELATED POSTS