This post is not exactly about MySQL Performance or about Performance at all, but I guess it should be interested to many MySQL DBAs and other people involved running MySQL In production.
Recently I’ve been involved in troubleshooting Dell Poweredge 2850 system running RAID5 using 6GB internal hard drives, which give about 1.4TB of usable space.
The problem started than one of hard drive was set to “Predicted Failure” state by “Patrol Read” which is automatically done by PERC4 (LSI Logic megaraid) controller. Dell was prompt to ship replacement hard drive and drive was replaced. This should be happy end of the story but in reality troubles only began.
After hard drive is replaced RAID has to be rebuilt but the problem in this case was…. rebuild failed bringing all logical drive down because yet another hard drive got bad block. Replaced hard drive was “failed” because it could not be rebuilt and other one because of read failure. So my first advice would be – Run consistency check before replacing hard drive with predicted failure to minimize chance of double drive failure in RAID. It is good to run consistency checks on regular basics anyway but this final run would not hurt.
The next interesting thing is – there is not too much advice which could be found in Dell documentation about handling RAID with two failed hard drives. The impression is it should never happen (while it does, and not as rarely as one would hope) and if it happened you should just go and get your backup. Restoring over 1TB of data is never fun but in this case there was no backup which made recovery more important.
Interesting enough Logical drive could be brought online and used by forcing newly failed drive online. It probably just had couple of bad blocks – but there were no way to resync logical hard drive in this situation.
What one would like to do in such case is to fore SCSI drive to remap those bad blocks. Couple of files could be corrupted but it is much better than loosing everything. Unfortunately neither RAID BIOS nor RAID tools do not provide you with such feature.
Happily Dell Bios has little option which allows you to disable RAID controller and access your disks as simple SCSI. Changing this option will result in various scarry messages such as “Data loss will occure” but in reality you could change it back and forth, you just should be careful and know hat you’re doing.
In SCSI BIOS there is an option to perform “Verify Media” which can be used to scan hard drive and remap bad blocks. After remapping is done RAID mode can be enabled back again and array could be rebuld just fine. There is of course chance some data is corrupted so checking file system and MySQL database is good idea.
So my story had a happy ending with only minimal (yet to be discovered) data loss but it coul be worse.
There are few things this case reminds about:
Do not assume RAID is Reliable. RAID is more reliable than plain disk, RAID6 is more reliable than RAID5 but all they can fail, even expensive SAN systems. So make sure you have backup plan if it happens if you care about your data. This is of course not to mention software bugs and user errors which are other reasons why you want backups. Do not trust to any single piece of hardware in HA scenarios.
Have backups ready. If you care about your data backups are must whatever other HA methods you use.
Large data sets take time Restoring 1.5TB volume is likely to take hours can you afford it ? Even verifying media on 300GB hard drive took several hours. This could be one more reason to scale out and keep managable size storage on each node. At least multiple smaller RAID volumes could be used so rebuilding any of them takes less time.
There are also couple of ideas:
Dell – why do not they have Verify Media with ability to remap bad blocks in the RAID BIOS itself or RAID tools ? Should not be big deal especially for offline drive.
Backups with Instant recovery – could be interesting to try to integrate DRBD with LVM so snapshot could be taken and synchronized to network as backup. If quick recovery is needed snapshot could be connected via network and operations started, while it is gradually restored in the background back to the local volume. Local Networks are fast these days so it could perform very well.