October 23, 2014

Make your file system error resilient

One of the typical problems I see setting up ext2/3/4 file system is sticking to defaults when it comes to behavior on errors. By default these filesystems are configured to Continue when error (such as IO error or meta data inconsistency) is discovered which can continue spreading corruption. This manifests itself in a worst way when device have some “flapping” problems returning errors every so often as this would cause some random pieces of data and meta data to be lost. Not good for system running mySQL Server. As far as I understand this problem is limited to EXT2/3/4 while over systems like XFS will not continue if consistency problems are discovered.

So how can you check what error behavior mode your file system has ? Run dumpe2fs /dev/sda1 and you will get something like this:

dumpe2fs 1.41.14 (22-Dec-2010)
Filesystem volume name:
Last mounted on: /mnt/data
Filesystem UUID: f9f7a0c3-0350-46d5-9930-29c3ac1f4b32
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg spars
e_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 226918400
Block count: 3630694400
Reserved block count: 0
Free blocks: 3616208434
Free inodes: 226918374
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 316
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 2048
Inode blocks per group: 128
RAID stride: 8
RAID stripe width: 80
Flex block group size: 16
Filesystem created: Mon Aug 22 23:03:21 2011
Last mount time: Mon Aug 22 23:18:25 2011
Last write time: Wed Aug 24 00:01:56 2011
Mount count: 2
Maximum mount count: -1
Last checked: Wed Aug 24 00:01:56 2011
Check interval: 0 ()
Lifetime writes: 54 GB
Reserved blocks uid: 0 (user unknown)
Reserved blocks gid: 0 (group unknown)
First inode: 11

This has a lot of interesting items and I’ll get into some of them a second later. What we’re concerned with right now is Errors behavior: Continue.
We can change behavior to remount-ro which will cause filesystem to become read-only and panic which will cause kernel panic. I believe remount-ro is the best option to use for the database server, though panic might be good option in high availability setup which would cause server to crash instead of continuing
in half working mode throwing errors etc (depending on which filesystem became read only)

To set error behavior to different value run tune2fs -e remount-ro /dev/sda1 which should have output something like:

tune2fs 1.41.14 (22-Dec-2010)
Setting error behavior to 2

It is worth to note when error is discovered during the operation EXT3, EXT4 filesystem will force file system check on the next startup which is handy.

Now I now some people are concerned about setting filesystem behavior to remount-ro or panic because this means even minor error in filesystem data structures which may be affects one file will take out whole file system. I do not think these concerns are valid. First with recent Linux versions and quality hardware EXT3 filesystem is extremely stable (EXT4 is good too though It is newer and I have shorter history with it). So if you have the error popping up you are very likely looking at hardware issues which can cause all kind of other nasty problems especially for database server. Second. The question comes to what you care the most – Do you care about consistency or availability ? Are you ready to risk for some data becoming inconsistent and increased data loss for system to be “up” (potentially serving wrong data) a little bit longer ? For most systems it is not worth tradeoff. Even more if you’re running Innodb chances are you will not buy you more “up time” either as Innodb is very
sensitive to corruptions and if any of file system errors are reported back to MySQL/Innodb it will assert and restart.

Now lets look at couple of other options you might want to tune with tune2fs:

Reserved block count: 0 Number of blocks reserved for root. It often defaults to 5% of total blocks, which is probably not needed for partition you store MySQL data on, as chances are MySQL server is only one doing writes on this partition anyway it just would be wasted if allocated. Some people like to keep it at some number so they have space reserve and if their database ran out of space they can buy a little bit of time before they find more permanent solution.

Maximum mount count: -1 and Check interval: 0 () These corresponds to automatic file system check on startup which is normally done once per so many mounts or so many days. Large partitions with many files can take a lot of time to check and can cause unwanted surprise when you’re restarting server and expecting it to be back in 5 minutes yet it takes 30+ because it has to check file systems. I believe it is much better to disable both these auto check functions for your data partition and just check it manually as needed, same as you would every so often check MySQL tables for corruption.

To change those options you can run tune2fs -m0 -i0 -c -1 /dev/sda1 changing reserved block percent, check interval and mount count appropriately.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Anthony DeRobertis says:

    The error handling mode can also be set on mount, which will override the one stored with tune2fs. This can be done with the ‘errors=remount-ro’ (or panic, etc.) option in /etc/fstab.

  2. Anthony,

    Thanks good point. For some reason I got into habit of changing it with tune2fs :)

  3. Anthony DeRobertis says:

    One more thing—if you’re trying to change the option on the root filesystem, you may have to update your initramfs (because that’s where the mount of rootfs is done).

    Thankfully, there is an easy way to check the mode thats actually in use:

    $ grep –color=yes errors= /proc/mounts
    /dev/mapper/Zia-root / ext4 rw,relatime,*errors=*remount-ro,user_xattr,acl,barrier=1,data=ordered 0 0
    /dev/md0 /boot ext2 rw,noatime,*errors=*continue 0 0

    No idea how to preserve color in your blog comments, hopefully asterisks do bold. /proc/mounts is the kernel’s idea of what the mounts are (and their options), and of course its the kernel’s idea that matters.

  4. XFS behavior is indeed different. It turns out on XFS that errors=remount_ro will not even be accepted. If you put it in /etc/fstab and try to mount you’ll get an error like the following.

    root@logos2:/# mount /xfs
    mount: wrong fs type, bad option, bad superblock on /dev/sdb,
    missing codepage or helper program, or other error
    In some cases useful info is found in syslog – try
    dmesg | tail or so

    This is one of the less helpful messages for XFS newbies. :)

  5. Robert,

    This is actually “mount” error message which corresponds to attempt to mount any file system and it is indeed confusing. dmesg usually gives more detailed into

  6. Tomas says:

    root@logos2:/# mount /xfs
    mount: wrong fs type, bad option, bad superblock on /dev/sdb,

    is telling you that you are mounting entire device(sdb) instead of patrition(sdbX where x is number).

  7. For UFS on Solaris there is a similar setting.

  8. Dja says:

    > Number of blocks reserved for root. [...] MySQL server is only one doing writes on this partition anyway it just would be wasted if allocated.

    When an ext* filesystem with “reserved block count: 0″ is nearly full, this generates file fragmentation, performance issues, unless you manually apply e2defrag regularily. Not a good idea for a database server.

Speak Your Mind

*