September 2, 2014

The perils of uniform hardware and RAID auto-learn cycles

Last night a customer had an emergency in selected machines on a large cluster of quite uniform database servers. Some of the servers were slowing down in a very puzzling way over a short time span (a couple of hours). Queries were taking multiple seconds to execute instead of being practically instantaneous. But nothing seemed to have changed. No increase in traffic, no code changes, nothing. Some of the servers got sick and then well again; others that seemed to be identical were still sick; others hadn’t shown any symptoms yet. The servers were in master-master pairs, and the reads were being served entirely from the passive machine in the pair. The servers that were having trouble were the active ones, those accepting writes.

The customer had graphs of system metrics (++customer), and the symptoms we observed were that Open_tables went sharply up, the buffer pool got more dirty pages, InnoDB started creating more pages, and the percentage of dirty pages in the buffer pool went up. In addition, samples of SHOW INNODB STATUS showed a long pending flush list (hundreds), and usually one pending pwrite.

After talking through things for a bit, I mentally eliminated a number of possibilities. For example, there was no shared storage; servers were on different network segments and in different racks; and so on. I had two remaining possibilities.

  1. The data had changed subtly, pushing it over some threshold. I’ve seen this before — for example, a query cache bottleneck, or an insert buffer bottleneck, or any of a variety of other things.
  2. Rather than looking for differences between the servers, it might be fruitful to look at what was uniform about them.

The next step would be to measure precisely what was happening, but before even doing that, I wanted to chase down a small hunch that would only take a minute. I asked the customer whether the servers were all the same. Were they in the same shipment from the manufacturer? Did they have hard drives from a single manufacturing batch? It turned out that the servers were all part of a single shipment, and not only that, but they had been rebooted simultaneously pretty recently. I asked the customer to check the RAID controller quickly on one of the failing machines.

The result: the battery-backed cache was in write-through mode instead of write-back mode, and the battery was at 33% and charging. We quickly checked a few other sick machines. Same thing — write-through, battery charging. And the battery backup unit was set to run through auto-learn cycles periodically, which is generally the factory default (these are Dells, with PERC / LSI MegaRAID controllers.)

I opened my quick-reference text file and found the command we needed to disable auto-learn, (++text files and saving links to reference material), sent that over the Skype chat, and the customer tried that. We weren’t sure the auto-learn could be stopped in the middle, but it worked. They hung up and hurried to push this out with Puppet before it hit hundreds of other machines.

I have heard stories before of uniform hardware causing a cluster of problems, such as a bad batch of hard drives all failing close together, but this is the first time that I remember being involved in such a case myself. It was kind of fun that I didn’t actually have to get all measurement-y and scientific about the problem after all.

About Baron Schwartz

Baron is the lead author of High Performance MySQL.
He is a former Percona employee.

Comments

  1. Dale Lancaster says:

    Very nice catch!

  2. Mark Callaghan says:

    Do they monitor RAID battery health?

  3. I did not think to ask that, but I would guess not.

  4. Gerry says:

    We saw similar issues while the controllers go through their discharge battery tests every 3 months to test the battery’s health. Thanks of the article.

    Cheers,
    G

  5. We have a number of servers with the same PERC controllers. I’ve noticed they fail into WriteThrough mode when the absolute capacity of the battery reaches 25%, so we began alerting on this metric. I’m curious if you recommended the customer disable the learning cycle entirely, or schedule it to run off-peak. I’ve often wondered if the frequency of these cycles has an impact on the life of the BBU – good to know the default is 90 days.

  6. I recommend disabling, and scheduling the cycles through cron. See the link in the body of the blog post for the command needed to manually start a cycle.

  7. George says:

    I have LSI controllers with BBU on workstation based pcs and the battery learn cycle is very annoying. But honestly never thought to tie it in with mysql/server performance. Thanks for this article.

  8. J-F Mammet says:

    It’s pretty surprising that these customers are not monitoring their RAID controllers with Nagios or whatever. Aren’t they concerned about the health of the hard drives themselves? I use Dell servers, and simply have nagios query the openmanage daemon running on all of these servers to get back an overall health status of the server’s hardware. And it sure sends me some nice and annoying warnings every few months when the battery is doing its test cycle, like these:
    Notification Type: PROBLEM

    Service: Openmanage
    Host: DbMaster
    Address: 10.100.24.1
    State: WARNING

    Date/Time: 12-09-2010 Additional Info : Cache battery 0 in controller 0 is Degraded (Non-Critical) [probably harmless]

    And normally:
    Date/Time: 12-09-2010 Additional Info : OK – System: PowerEdge R710, SN: xxx, hardware working fine, 1 logical drives, 8 physical drives

    Anyway, I enjoy the blog, thanks for the articles :)

  9. Matt says:

    Do these people not check basics? On my FreeBSD machines I get the following in /var/log/messages (hostname, controller log entry# and timestamps removed for brevity):

    Sep 9 01:21:51 kernel: mfi0: (info) – Battery relearn will start in 4 days
    Sep 11 01:22:29 kernel: mfi0: (info) – Battery relearn will start in 2 day
    Sep 11 03:02:27 kernel: mfi0: (info) – Patrol Read started
    Sep 11 04:46:06 kernel: mfi0: (info) – Patrol Read complete
    Sep 12 01:22:15 kernel: mfi0: (info) – Battery relearn will start in 1 day
    Sep 12 20:21:56 kernel: mfi0: (info) – Battery relearn will start in 5 hours
    Sep 13 01:23:06 kernel: mfi0: (info) – Battery relearn pending: Battery is under charge
    Sep 13 01:51:16 kernel: mfi0: (info) – Battery relearn started
    Sep 13 01:52:21 kernel: mfi0: (info) – Battery is discharging
    Sep 13 01:52:21 kernel: mfi0: (info) – Battery relearn in progress
    Sep 13 03:09:36 kernel: mfi0: (WARN) – BBU disabled; changing WB virtual disks to WT
    Sep 13 03:09:36 kernel: mfi0: (info) – Policy change on VD 00/0 to [ID=00,dcp=01,ccp=00,ap=0,dc=1,dbgi=0] from [ID=00,dcp=01,ccp=01,ap=0,dc=1,dbgi=0]
    Sep 13 04:13:06 kernel: mfi0: (info) – Battery relearn completed
    Sep 13 04:13:21 kernel: mfi0: (info) – Battery started charging
    Sep 13 04:13:21 kernel: mfi0: (WARN) – Current capacity of the battery is below threshold
    Sep 13 05:54:06 kernel: mfi0: (info) – Current capacity of the battery is above threshold
    Sep 13 05:54:06 kernel: mfi0: (info) – BBU enabled; changing WT virtual disks to WB
    Sep 13 05:54:06 kernel: mfi0: (info) – Policy change on VD 00/0 to [ID=00,dcp=01,ccp=01,ap=0,dc=1,dbgi=0] from [ID=00,dcp=01,ccp=00,ap=0,dc=1,dbgi=0]
    Sep 13 08:54:36 kernel: mfi0: (info) – Battery charge complete

    Surely a brief look at the system logs should have yielded the answer without need for guesswork or hunches (a.k.a. “guesswork”)?

    Before deployment I have a little checklist for setting up PERC6s, which includes setting up the time of patrol reads and battery relearns to be in quiet periods.

    Interestingly enough I recently discovered a couple of controllers changing to write-through for periods of about 3 hours at seemingly random times. I decided to graph a few controller parameters for a few weeks and saw the controllers being affected were the ones with much greater reported capacity (~1700mAh compared to ~1000mAh on unaffected cards).

    What appears to be happening is the lower capacity batteries are being recharged every couple of days to keep them within ~100mAh of their reported capacity. These bigger batteries aren’t – they’re only being charged when the firmware notices them being beneath a threshold (~50% capacity) then it disables WB and starts charging.

    These are PERC6/i and PERC6/e cards on the latest firmware. Since they’re in mission critical machines running a couple TB of MyISAM, I daren’t bother with Dell “support” (who completely fail to recognise the concept of downtime being bad for business). Instead I’ve used MegaCLI to force write-back on bad BBU, as if the power goes I’m pretty much guaranteed to need to take a drive and restore from backup (much quicker than myisamchk/REPAIR TABLE)

  10. Monitoring raid controller health is a standard part of the checks we set up in Nagios. But nobody’s perfect, and this isn’t a client that we set up Nagios for. I bet your Nagios checks are not perfect and comprehensive, either.

  11. I sounded too abrupt in that last comment. My point is that a few comments kind of sounded disbelieving that this customer doesn’t do the basics and is somehow negligent, but the reality is that 99% of people are missing what we might consider “the basics”. Maybe I have gotten numbed, but I’m pleasantly surprised when a customer even has graphs of system stats like this one did.

    I expect that the silent majority of people who read this post, and do NOT comment, are feeling ashamed that they don’t have Nagios installed at all, or that they have never even thought about checking RAID health with Nagios if they do. It’s surely preferable to have everything ship-shape, but I don’t blame people for not being able to keep up with everything. Everyone’s overworked and understaffed.

  12. Patrick Casey says:

    I have to agree with Baron here. In my experience, once the farm reaches a certain size, you never end up looking at /var/log/messages for the simple reason that you can’t be logging into boxes. If your day to day job involes logging into lots of boxes to check messages and whatnot, then you’ve already lost the scalability battle. Only way to run a large farm is to let the boxes run themselves, use tools to ensure uniform config, then then use remote monitoring to try to pick up things that look wrong. Then, once a box starts acting up, you can have somebody dive in and see what’s afoot from the logs and whatnot.

    And as others have pointed out, monitoring is never a perfect game. If you monitor everything, you A) produce more data than your noc can consume and B) load your boxes down doing nothing but responding to monitoring data requests, so its always a tradeoff between: What is the cost to monitor parameter X vs what is the likely cost if I don’t monitor it?

    There’s always somebody after the fact who can chime in with “if only you’d monitored parameter x, you would have caught this problem in its infancy”, but its very rarely the case that the same individual can identify that parameter beforehand.

  13. vegivamp says:

    @Patrick, that’s exactly what logcheck is good for :-) Basically you set up a central syslogging server, let all machines duplicate their logs to it (UDP, so no lag); and have logcheck filter out the gunk and mail you the important bits every hour or so.

    I set this up at most of my employers, and once well-tuned, you rarely get false positives anymore.

  14. peter says:

    I’m not sure if things changed but typically you would not get anything in the logs with PERC controllers and drivers which are in Linux kernel. If you have OMSA and Nagios check for it it will give you clear indication.

  15. J-F Mammet says:

    Baron: I’m certainly not pretending my Nagios setup is perfect, I’m not *that* conceited (yet) :) But monitoring the health of your hardware is such a no brainer in my mind, especially when hard drives are concerned, that I was rather surprised people with such a seemingly huge mysql deployment didn’t have this. I’ve lost a lot of time in raid crashes before, and I’m sure most other system engineers did too, so just monitoring if the hard drives are working correctly is the first thing I do when the operation system is installed.
    But maybe when you reach a critical size with redundant systems you don’t care much anymore about the health of a single node. I’m not at that point yet, nor do I think I’ll ever will be when it comes to mysql at this current work, so I wouldn’t know.

  16. J-F Mammet, absolutely — I think that monitoring the RAID status is a fundamental activity. I’m just used to people not doing a lot of fundamentals :)

    I think that this is just an example of how complex systems fail. There is never one single root cause, as people would love to find. There is always a chain of failures. I had an experience in a hospital recently where a patient was not getting good care, so we called the number listed on the wall “call this number if you aren’t getting good care” and there was a failure in the processes of the person who answered that phone, so we still did not get good care for the patient. What was the root cause? Was it the bad doctor, or the administrators who could not fix the bad doctor’s mess? Both were at fault.

    Failing to monitor the RAID health is just the obvious failure. I’ve seen cases where the monitoring was in place, but firewall rules changed and the alert emails couldn’t be sent. http://en.wikipedia.org/wiki/For_Want_of_a_Nail_%28proverb%29

  17. chris says:

    If anyone is just stumbling across this article now, the H700, H800, PERC 6I/E all have firmware updates that address the battery charging issue (A08 and later).

  18. Hi,

    as was indicated by one reader:
    It would be good to be more careful about this.

    the learn cycle is there for a reason. it’s making sure you don’t lose data.
    without the learn cycle the controller will not catch some battery failures.
    for the same reason the controller switches to write through in that period.
    not because some conehead decided so,
    but because during the learn cycle the battery won’t hold your cache protected.

    the same will apply, unnoticed, without learn cycle runs. a battery pack can silently fail.
    happens – unless you’re lucky enough for the bbu to completely fail so you hopefully notice.

    what to do?
    - get newer controllers with FBWC (the best solution)
    - schedule(!) the learn cycle if possible.

  19. Rico says:

    The link to the reference material was dead.

    Fixed:

    http://blog.yo61.com/dell-drac-bbu-auto-learn-tests-kill-disk-performance/

Speak Your Mind

*