We receive many requests for help with server stalls. They come under various names: lockup, freeze, sudden slowdown. When something happens only once or twice a day, it can be difficult to catch it in action. Unfortunately, this often leads to trial-and-error approaches, which can drag on for days (or even months), and cause a lot of harm due to the “error” part of “trial-and-error.” At Percona we have become skilled at diagnosing these types of problems, and we can solve many of them quickly and conclusively with no guesswork. The key is to use a logical approach and good tools.
The process is straightforward:
- Determine what conditions are observably abnormal when the problem occurs.
- Gather diagnostic data when the conditions occur.
- Analyze the diagnostic data. The answer will usually be obvious.
Step 1 is usually pretty simple, but it’s the most important to get right. I estimate that about 80% of the cases I work on show up as abnormally high values of Threads_connected or Threads_running. That’s a good place to start for finding a symptom that can trigger the problem. If you aren’t sure what “normal” looks like, and therefore what’s abnormal, then gather samples of things you can observe (such as SHOW STATUS snapshots) and establish both baseline behavior and deviations from it.
The hard part used to be step 2. However, over time the Percona team members have built a good set of tools for doing this on Linux. It is no harder in principle on other OSes, but we have so many customers on Linux that we end up scripting mostly Linux-centric things, so a little tweaking to the tools might be needed on other platforms. The rest of this blog post is about how to set up and use the tools.
The tools are all part of the Aspersa toolkit. There are three primary tools for troubleshooting intermittent problems: stalk, collect, and sift. As the names suggest, they are used to watch the server, gather diagnostic data when the condition occurs, and help you look through it quickly (there’s a lot of it). You can get the tools easily. The best thing to do is put them into root’s home directory; they will gather the most data possible if they’re run as root. Running under sudo is an option too. Assuming root, here’s how to do it:
mkdir -p bin
chmod +x stalk collect sift
Now you need to edit the stalk file and tell it your definition of “abnormal” — by default it has some vanilla settings that aren’t useful. (There is no such thing as a good default, sorry.) Let’s assume that your server normally has only a few Threads_running, and when the problem happens, it spikes up into the range of 20 or higher. A safe trigger might be 15, if your server never reaches 15 under normal conditions. The more carefully you determine normal versus abnormal, the higher quality your results will be. Find and change the following lines:
# This is the max number of <whatever> we want to tolerate.
# This is the thing to check for.
You need to change the threshold to 15, and the variable to Threads_running. Check the other settings, but this is often all you need. Finally, make sure that the script can log into MySQL (if necessary, put a password into .my.cnf, or in the MYSQLOPTIONS variable). Now you can simply start running bin/stalk in a screen session. It shouldn’t do anything but print a timestamp line every 30 seconds. Come back later and see what it found, or set the EMAIL setting to notify you.
What’s so special about the collect tool? Isn’t [insert your favorite OS’s awesome tool that Linux doesn’t have] better? What’s nice about collect is that it gathers many different types of data, including different angles on the same data. No single tool, no matter how powerful, is an answer for every problem. For example, the collect tool makes it easy to gather snapshots of waiting (stack traces) and CPU consumption (oprofile). That’s important because sometimes the problem is because of blocking and waiting to do work, and sometimes the problem is doing too much work without blocking. Likewise, it gathers lots of I/O data, because sometimes the problem is that the disks are not responding within reasonable and expected tolerances, and sometimes the disks are responding very well but the server is abusing them and overloading them. And sometimes, of course, the CPUs and disks have nothing to do with anything, which is why the collect tool also gathers information about network connections, and so on. If you assume a particular type of trouble in advance, you’ll be wrong sometimes, and it’s back to trial-and-error. You need to gather all the data, and then look at it and see which bits are diagnostic and which bits show nothing relevant.
After you’ve gathered some data, you need to look through it. It is stored in /root/collected/ by default. Use the “sift” tool to examine it. I’ll demo on my laptop:
[baron@ginger]$ sift collected
2011_02_28_14_51_38 2011_03_01_02_52_52 2011_03_01_14_29_35
2011_03_01_18_38_09 2011_03_02_08_36_25 2011_03_02_11_33_59
Select a timestamp from the list: 2011_03_02_15_10_03
It’s prompting you to choose which sample of data to analyze. There are 7 samples, and it chooses the most recent by default, so you can just hit the Return key and start browsing the files. The tool is keystroke-driven, Vim-style, so you can navigate the samples with the ‘j’ and ‘k’ keys. I won’t repeat the official documentation here. Give it a try and see how it works; press the ‘?’ key for a list of the keystroke commands you can use. If you find that one of the samples looks suspicious, you can drill into it with a few of the other Aspersa tools, or ask sift to start ‘less’ on a sample’s files so you can look at them. You can analyze the files without sift, but it saves a lot of typing, especially when you have dozens or hundreds of samples to examine.
The sift tool isn’t the final step in analysis. By default it just shows you a few of the highest-level summary tidbits of information, and you often really need to drill into a particular sample’s files and analyze it to see what’s going on. Sometimes it’s glaringly obvious even at a high level, but not always. Most of the time the problem is clear upon a closer inspection, though.
If the answer is not obvious, then maybe you have a false positive and didn’t really catch the problem in action; or collect isn’t gathering the information needed to diagnose (pretty unlikely — it gathers a ton of stuff); or you don’t know how to interpret what you have found. Within Percona, the best course of action is often to ask a more senior person to help. I ask people such as Peter, Vadim, and Yasufumi for help all the time. If you’re not a Percona support/consulting staff member, then of course you can hire us.
If I get a few spare hours, I will put together some screencasts illustrating how to use these tools. I think that’ll be an easier and more fun way to learn them. I have some interesting cases that I can use as demos, after I scrub them for sensitive details. When I finish these, I’ll post them on Percona.TV, our MySQL video/screencast blog site.
Another interesting topic is how to examine a system that appears to be fine and determine whether it is experiencing any very short stalls that you are not aware of. There are many things that can cause a stall in MySQL, and they usually begin microscopically and get worse over time, sometimes abruptly worse. I’ll try to write more about this as time permits, too.