October 24, 2014

Recovering Linux software RAID, RAID5 Array

Dealing with MySQL you might need to deal with RAID recovery every so often. Sometimes because of client lacking the proper backup or sometimes because recovering RAID might improve recovery, for example you might get point in time recovery while backup setup only takes you to the point where last binary log was backed up. I wanted for a chance to write instructions for recovery for long time
and finally I had gotten the problems with my ReadyNAS Pro 6 which I was setting up/testing at home for use for backups. I got it doing initial sync while it spotted the problem with one other drive and as such RAID volume failed. ReadyNAS has Debian inside and as you can get root login via SSH it can be recovered as any generic Linux server.

When you restart the system RAID5 volume which has more than 1 failed hard drive will be completely inaccessible, this happens even you just happen to be one bad sector on the disk. This “paranoid” behavior helps to preserve consistency however it can scare hell out of you, giving you no access to the data at all. Not all Hardware RAID would have this behavior.

First lets see what the status of array is:

ReadyNAS1:~# mdadm -Q –detail /dev/md2
/dev/md2:
Version : 1.2
Creation Time : Fri Aug 26 13:51:11 2011
Raid Level : raid5
Used Dev Size : -1
Raid Devices : 6
Total Devices : 5
Persistence : Superblock is persistent

Update Time : Fri Aug 26 22:11:26 2011
State : active, FAILED, Not Started
Active Devices : 4
Working Devices : 5
Failed Devices : 0
Spare Devices : 1

Layout : left-symmetric
Chunk Size : 64K

Name : 001F33EABA01:2
UUID : 01a26106:50b297a8:1d542f0a:5c9b74c6
Events : 83

Number Major Minor RaidDevice State
0 8 3 0 active sync /dev/sda3
1 8 19 1 active sync /dev/sdb3
2 8 35 2 active sync /dev/sdc3
3 0 0 3 removed
4 8 67 4 active sync /dev/sde3
5 8 83 5 spare rebuilding /dev/sdf3

In this case I know/dev/sdf3 was being rebuilt when /dev/sdd3 developed problems. If you do not know what disk was being resynced (or failed first) you can check it examining all volumes separately:

ReadyNAS1:~# mdadm –examine /dev/sdf3
/dev/sdf3:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x2
Array UUID : 01a26106:50b297a8:1d542f0a:5c9b74c6
Name : 001F33EABA01:2
Creation Time : Fri Aug 26 13:51:11 2011
Raid Level : raid5
Raid Devices : 6

Avail Dev Size : 5851089777 (2790.02 GiB 2995.76 GB)
Array Size : 29255447040 (13950.08 GiB 14978.79 GB)
Used Dev Size : 5851089408 (2790.02 GiB 2995.76 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
Recovery Offset : 5631463944 sectors
State : clean
Device UUID : 77ea1f91:5d4915c3:5cd17402:7f1ecafb

Update Time : Fri Aug 26 22:11:26 2011
Checksum : d9052ded – correct
Events : 83

Layout : left-symmetric
Chunk Size : 64K

Device Role : Active device 5
Array State : AAA.AA (‘A’ == active, ‘.’ == missing)

Note the Update time here. The oldest disk to fail will have earliest update time. Here is the nice article explaining it in more details.

I was looking for a way to tell mdadm to change status of rebuilding drive into “failed” and removed in “active sync” but I could not find a way to do it. It could be by design as using such commands wrong
way can ruin your RAID array. What you can do instead is to re-create RAID array in the same configuration as it was initially created, making the drive you want to be skipped (like first failed drive, or drive being resynced) as missing:

ReadyNAS1:/# mdadm –stop /dev/md2
mdadm: stopped /dev/md2

mdadm –verbose –create /dev/md2 –chunk=64 –level=5 –raid-devices=6 /dev/sda3 /dev/sdb3 /dev/sdc3 /dev/sdd3 /dev/sde3 missing

mdadm: layout defaults to left-symmetric
mdadm: layout defaults to left-symmetric
mdadm: layout defaults to left-symmetric
mdadm: /dev/sda3 appears to be part of a raid array:
level=raid5 devices=6 ctime=Fri Aug 26 13:51:11 2011
mdadm: layout defaults to left-symmetric
mdadm: /dev/sdb3 appears to be part of a raid array:
level=raid5 devices=6 ctime=Fri Aug 26 13:51:11 2011
mdadm: layout defaults to left-symmetric
mdadm: /dev/sdc3 appears to be part of a raid array:
level=raid5 devices=6 ctime=Fri Aug 26 13:51:11 2011
mdadm: layout defaults to left-symmetric
mdadm: /dev/sdd3 appears to be part of a raid array:
level=raid5 devices=6 ctime=Fri Aug 26 13:51:11 2011
mdadm: layout defaults to left-symmetric
mdadm: /dev/sde3 appears to be part of a raid array:
level=raid5 devices=6 ctime=Fri Aug 26 13:51:11 2011
mdadm: size set to 2925544704K
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata

Note it is very important to have one specific drive as missing in this case if you do not one of them will be picked in as to be resynced from others, and if it is wrong drive you will lose your data. Creating RAID in such way also allows you to check if you have guessed correctly and if you created RAID wrong way you probably will be unable to mount it, find LVM volumes on it etc, and you can go back and correct your error.

ReadyNAS1:/# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid5 sde3[4] sdd3[3] sdc3[2] sdb3[1] sda3[0]
14627723520 blocks super 1.2 level 5, 64k chunk, algorithm 2 [6/5] [UUUUU_]

So you can see RAID is active now though it is missing one of the disks, running on 5 instead of 6. Before we start resync lets validate it was assembled correctly:

ReadyNAS1:/# mount /dev/c/c
mount: special device /dev/c/c does not exist
ReadyNAS1:/# lvdisplay
— Logical volume —
LV Name /dev/c/c
VG Name c
LV UUID Rd66bT-qF3P-MgES-F9jK-zQ01-t0qo-qbo070
LV Write Access read/write
LV Status NOT available
LV Size 13.61 TB
Current LE 223041
Segments 1
Allocation inherit
Read ahead sectors 0

So we have RAID volume back but LVM shows this volume as NOT available and so it can’t be mounted. Happily it can be easily fixed:

ReadyNAS1:/# vgchange -a y
1 logical volume(s) in volume group “c” now active

Lets check if the file system is in the good shape:

ReadyNAS1:/# fsck /dev/c/c
fsck 1.41.14 (22-Dec-2010)
e2fsck 1.41.14 (22-Dec-2010)
/dev/c/c: clean, 25/228395008 files, 14579695/3654303744 blocks

good so now we are pretty sure it is in a good shape. I also mounted it and checked couple of files to be sure.

Lets try to add the last drive to the volume so it can attempt to resync. I should give you a warning though. In many cases this is valid thing to do even without replacing any hard drives, from my experience I can tell at least 50% of failed hard drives in the RAID array are false positives – either simply adding drive back or re-seating the hot swap hard drive solves the problem. If the drive has bad blocks though resync is likely to cause these being read and when array will fail again. If this is what is happening you have couple of options. First you can just copy the data from RAID array bypassing any files which have bad sectors. Often this will be a file or two. The second way is to get something like Drive Fitness Test (different hard drive vendors have different versions) such tools would often have a functionality to scan hard drive for bad blocks and remap them if there are spare sectors available. It might be able to read data from the original sectors or it might fail to do that, in which case they will be zeroed out and your data potentially corrupted. This is why I prefer to know files affected by bad blocks on the first place.

ReadyNAS1:/# mdadm -a /dev/md2 /dev/sdf3
mdadm: added /dev/sdf3

ReadyNAS1:/# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid5 sdf3[6] sde3[4] sdd3[3] sdc3[2] sdb3[1] sda3[0]
14627723520 blocks super 1.2 level 5, 64k chunk, algorithm 2 [6/5] [UUUUU_]
[==>………………] recovery = 12.6% (371104968/2925544704) finish=488.9min speed=87075K/sec

As you see the drive is being rebuilt now though we’re yet to see if it runs into problems with volume again.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. I’ve used dd_rescue a few times to make a copy of the disks/cfcards before using your method of recovery. dd_rescue won’t exit when a IO error occors, it will try to copy as much as possible by using smaller and smaller block sizes. It can also start at the end of the device and read up to the start, which is helpful for cf cards with a bad block (read the bad block and the complete device will become unavailable)

  2. Daniel,

    Thank you I forgot to mention this tool even though I used it for recovery couple of times. Indeed it can be very helpful to get disk with no bad blocks, though I suffers the same problem as drive reconditioning software – you would not know which files are damaged. With Innodb it is thankfully can be checked running checksum check on your tablespaces but if it is something else it might not be that easy.

  3. Andrew Delpha says:

    Very important to note – when recreating the array, make sure to specify the metadata version as the same as what you are currently using or else you can overwrite data because the different metadata versions have different sizes/locations.

    I had a similar situation to the one outlined above, and when I recreated the array, the metadata got bumped to 1.2 from 0.90 which messed up my LVM metadata (I was able to recover from this, but it added a lot of pain to this process)

    You can specify the metadata version to use on the mdadm command line – for example

    mdadm –verbose –create /dev/md2 –chunk=64 –level=5 –metadata=0.90 –raid-devices=6 /dev/sda3 /dev/sdb3 /dev/sdc3 /dev/sdd3 /dev/sde3 missing

Speak Your Mind

*