Last night a customer had an emergency in selected machines on a large cluster of quite uniform database servers. Some of the servers were slowing down in a very puzzling way over a short time span (a couple of hours). Queries were taking multiple seconds to execute instead of being practically instantaneous. But nothing seemed to have changed. No increase in traffic, no code changes, nothing. Some of the servers got sick and then well again; others that seemed to be identical were still sick; others hadn’t shown any symptoms yet. The servers were in master-master pairs, and the reads were being served entirely from the passive machine in the pair. The servers that were having trouble were the active ones, those accepting writes.
The customer had graphs of system metrics (++customer), and the symptoms we observed were that Open_tables went sharply up, the buffer pool got more dirty pages, InnoDB started creating more pages, and the percentage of dirty pages in the buffer pool went up. In addition, samples of SHOW INNODB STATUS showed a long pending flush list (hundreds), and usually one pending pwrite.
After talking through things for a bit, I mentally eliminated a number of possibilities. For example, there was no shared storage; servers were on different network segments and in different racks; and so on. I had two remaining possibilities.
- The data had changed subtly, pushing it over some threshold. I’ve seen this before — for example, a query cache bottleneck, or an insert buffer bottleneck, or any of a variety of other things.
- Rather than looking for differences between the servers, it might be fruitful to look at what was uniform about them.
The next step would be to measure precisely what was happening, but before even doing that, I wanted to chase down a small hunch that would only take a minute. I asked the customer whether the servers were all the same. Were they in the same shipment from the manufacturer? Did they have hard drives from a single manufacturing batch? It turned out that the servers were all part of a single shipment, and not only that, but they had been rebooted simultaneously pretty recently. I asked the customer to check the RAID controller quickly on one of the failing machines.
The result: the battery-backed cache was in write-through mode instead of write-back mode, and the battery was at 33% and charging. We quickly checked a few other sick machines. Same thing — write-through, battery charging. And the battery backup unit was set to run through auto-learn cycles periodically, which is generally the factory default (these are Dells, with PERC / LSI MegaRAID controllers.)
I opened my quick-reference text file and found the command we needed to disable auto-learn, (++text files and saving links to reference material), sent that over the Skype chat, and the customer tried that. We weren’t sure the auto-learn could be stopped in the middle, but it worked. They hung up and hurried to push this out with Puppet before it hit hundreds of other machines.
I have heard stories before of uniform hardware causing a cluster of problems, such as a bad batch of hard drives all failing close together, but this is the first time that I remember being involved in such a case myself. It was kind of fun that I didn’t actually have to get all measurement-y and scientific about the problem after all.