October 20, 2014

Battery Learning still problem many years after

The performance problems caused by battery auto learning go many years back. We wrote about it, other people from MySQL Community too. The situation did not get better, at least not with Dell RAID controllers, H700 and H800 have the same problem too. At the same time situation got worse as a lot more people are running Innodb in full durability mode which is dramatically affected by this setting.

First I should wonder how common this problem is outside of Dell product line ? (Which is using LSI chips) Are there RAID controllers which do not have this problem ? For many installations it would make sense to pay few hundred dollars more per server just to avoid nightmare of scheduling learning cycles.

I’m surprised it takes so many years to do it. Can’t one use capacitor instead and bundle 512MB cache with 512MB of Flash so when power goes down the cache is stored on Flash ? Or can’t one put batteries which can be moved independently ?
It looks like H800 has “transportable non volatile cache (TVNC) as an option but it does not look like it.

As the problem is still there what can you do about it ? First Test it. You can trigger learn cycle by disabling auto-learn and when triggering learning either by MegaCLI or by Open Management tools (see this for example). You will see for how long battery cache gets disabled in your system (it is only part of all learn phase). You can also shift cache mode in Write Through if you do not have very long time for testing. I recommend this testing as part of complex IO subsystem performance testing – if you have RAID check what performance is going to be with failed hard drive (and during rebuild stage), what overhead LVM takes for backup etc. It may be performance drop is not such a bad issue for you so you can just take it during the night or you might need to do something such as getting Slave out of rotation when it is going through the process.

Second. Schedule it. Most systems would be much better with scheduled learning during the night or weekend, where it can be done on different servers at different times with team informed about slower performance than catching everyone by surprise (at least first time).

Third you may chose to compromise on ACID during such period of times. RAID gives an option to force write back even with no battery which will likely trash your database if power goes down during learning process. It may be fine for your data if not you may be able to get less penalty going from innodb_flush_log_at_trx_commit=1 and sync_binlog=1 to values 2 and 0 appropriately. Both can be done without server restart with Innodb Plugin, Percona Server and MySQL 5.5. Note it might also be good to increase innodb_write_io_threads
to get more outstanding requests – without cache it matters a lot for writes. This is a lot better than forcing write cache without Battery as database should not get corrupted in case of bad crash timing, though you may lose some uncommitted transactions and binlog may get out of sync with Innodb transaction logs.

I’m also wondering if this is something where Facebook Flash Cache can be helpful – if it can act instead of hardware BBU cache. Would be interesting to test.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Peter,

    On the HP P800, at least, you can have two batteries, and the learning phases are done alternately so a battery is always available.

    Regards,

    Jeremy

  2. HP also has a flash-cache solution in their newer controllers I believe. I agree with you, though, Peter, and have been complaining about the lack of multiple battery support, or an alternative option (capacitor + flash sounds like a neat idea) for a while. It’s ridiculous given that it’s such an easy thing to solve. I’m glad HP appears to have solved it with, it sounds like, at least two solutions.

  3. Jeremy, Tim,

    Great tips. Any other RAID controllers which do not have this problem ?

  4. I was skeptical of the HP controllers myself and haven’t but them to the test (yet) directly. So, unfortunately, no :(

  5. Rene H. Larsen says:

    Most series 5 Adaptec RAID controllers are available with flash+capacitor instead of battery backup. These controllers are distinguished by having a “Z” in their model name (Z for Zero Maintenance).

    The upcoming series 6 controllers will all support flash+capacitor based backup, I belive.

  6. Scott Kahler says:

    Something to add to the mix on Dell controllers. We run a large number of PERC6/Es in a cluster. Recently the battery on them have started to report bad. When they can only recharge to about 1/3 capacity Openmanage will start to complain. Dell says the batteries on those have an expected life time of a year. We are only seeing this on a cluster that’s doing extremely high amounts of IO.

  7. Scott,

    Yes failing batteries is another concern. Though I usually have seen them lasting much longer than one year. As battery is there just to keep power going to DRAM when main power fails I do not understand how it should be related to amount of IO system is doing. Though this seems to be very well known issue to account for, ie check the Adaptec Whitepaper http://www.adaptec.com/nr/rdonlyres/7fd8c372-8231-4727-b12b-5abf79d9325c/0/6514_series5z_1_7.pdf

Speak Your Mind

*