September 22, 2014

Setting up XFS on Hardware RAID — the simple edition

There are about a gazillion FAQs and HOWTOs out there that talk about XFS configuration, RAID IO alignment, and mount point options.  I wanted to try to put some of that information together in a condensed and simplified format that will work for the majority of use cases.  This is not meant to cover every single tuning option, but rather to cover the important bases in a simple and easy to understand way.

Let’s say you have a server with standard hardware RAID setup running conventional HDDs.

RAID setup

For the sake of simplicity you create one single RAID logical volume that covers all your available drives.  This is the easiest setup to configure and maintain and is the best choice for operability in the majority of normal configurations.  Are there ways to squeeze more performance out of a server by dividing the logical volumes: perhaps, but it requires a lot of fiddling and custom tuning to accomplish.

There are plenty of other posts out there that discuss RAID minutia.  Make sure you cover the following:

  • RAID type (usually 5 or 1+0)
  • RAID stripe size
  • BBU enabled with Write-back cache only
  • No read cache or read-ahead
  • No drive write cache enabled

Partitioning

You want to run only MySQL on this box, and you want to ensure your MySQL datadir is separated from the OS in case you ever want to upgrade the OS, but otherwise keep it simple.  My suggestion?  Plan on allocating partitions roughly as follows, based on your available drive space and keeping in mind future growth.

  • 8-16G for Swap –
  • 10-20G for the OS (/)
  • Possibly 10G+ for /tmp  (note you could also point mysql’s tmpdir elsewhere)
  • Everything else for MySQL (/mnt/data or similar):  (sym-link /var/lib/mysql into here when you setup mysql)

Are there alternatives?  Yes.  Can you have separate partitions for Innodb log volumes, etc.?  Sure.  Is it work doing much more than this most of the time?  I’d argue not until you’re sure you are I/O bound and need to squeeze every last ounce of performance from the box.  Fiddling with how to allocate drives and drive space from partition to partition is a lot of operational work which should be spent only when needed.

Aligning the Partitions

Once you have the partitions, it could look something like this:
 Several months ago my colleague Aurimas posted two excellent blogs on both the theory of Aligning IO on hardware RAID and some good benchmarks to emphasize the point, go read those if you need the theory here.  Is it common on modern Linux systems for this to be off?  Maybe not, but here’s how you check.
  We want to use mysql on /dev/sda3, but how can we ensure that it is aligned with the RAID stripes?  It takes a small amount of math:
  • Start with your RAID stripe size.  Let’s use 64k which is a common default.  In this case 64K = 2^16 = 65536 bytes.
  • Get your sector size from fdisk.  In this case 512 bytes.
  • Calculate how many sectors fit in a RAID stripe.   65536 / 512 = 128 sectors per stripe.
  • Get start boundary of our mysql partition from fdisk: 27344896.
  • See if the Start boundary for our mysql partition falls on a stripe boundary by dividing the start sector of the partition by the sectors per stripe:  27344896 / 128 = 213632.  This is a whole number, so we are good.  If it had a remainder, then our partition would not start on a RAID stripe boundary.

Create the Filesystem

XFS requires a little massaging (or a lot).  For a standard server, it’s fairly simple.  We need to know two things:

  • RAID stripe size
  • Number of unique, utilized disks in the RAID.  This turns out to be the same as the size formulas I gave above:
    • RAID 1+0:  is a set of mirrored drives, so the number here is num drives / 2.
    • RAID 5: is striped drives plus one full drive of parity, so the number here is num drives – 1.
In our case, it is RAID 1+0 64k stripe with 8 drives.  Since those drives each have a mirror, there are really 4 sets of unique drives that are striped over the top.  Using these numbers, we set the ‘su’ and ‘sw’ options in mkfs.xfs with those two values respectively.

The XFS FAQ is a good place to check out for more details.

Mount the filesystem

Again, there are many options to use here, but let’s use some simple ones:

Setting the IO scheduler

This is a commonly missed step related to getting the IO setup properly.  The best choices here are between ‘deadline’ and ‘noop’.   Deadline is an active scheduler, and noop simply means IO will be handled without rescheduling.  Which is best is workload dependent, but in the simple case you would be well-served by either.  Two steps here:

And to make it permanent, add ‘elevator=<your choice>’ in your grub.conf at the end of the kernel line:

 

This is a complicated topic, and I’ve tried to temper the complexity with what will provide the most benefit.  What has made most improvement for you that could be added without much complexity?

About Jay Janssen

Jay joined Percona in 2011 after 7 years at Yahoo working in a variety of fields including High Availability architectures, MySQL training, tool building, global server load balancing, multi-datacenter environments, operationalization, and monitoring. He holds a B.S. of Computer Science from Rochester Institute of Technology.

Comments

  1. Dave Juntgen says:

    Jay – good summary, what you are thoughts on RAID stripe size? I had a sysadmin change the strip size of the RAID control from the default (64k) to 256k because he thought it would help reads, maybe for non database workloads and innodb does a good job of buffer reads already…So I’m skeptical about his change.

    Lets say I have a RAID10 configuration, with 4 drives. So, 4/2 = 2. If I have a data block that is < RAID stripe size, in this case 256k. Does this mean that the RAID controller is not going to cut the data up into smaller chunks, say 4 64k chunks and write them in parallel the across two drives? In my mind and what I have read, increasing the RAID stripe size typically doesn't help with database work loads. You thoughts?

  2. Nils says:

    I tend to create the xfs log with lazy-count=1.

  3. nate says:

    The tuning of the file system for the raid stripe size is strange to me. What do you do when your working with a virtualized storage array which allows for non disruptive addition of disks and re-striping of data over more disks as you add them to the array? From the array standpoint there is no application impact, but with xfs it sounds like you want to reformat each time the you expand the array?

    Or does it not matter because the big intelligent caches on the array eliminate the need to tune the file system for the RAID config.

    My storage array for example spreads i/o across every spindle in the system(assuming they are of the same speed e.g. 15k rpm 7200 rpm) by default(I’ve never had a reason to not use the default).

  4. @Nils, the lazy-count is enabled by default on mkfs.xfs.

    @Nate, The idea behind stripe size is aligned I/O (It is not just XFS which allows sunit/swidth, ext* allow it too) , if you have restriped and added/removed disks, you can specify different sunit/swidth to mount for xfs. However, this is not recommended since AG/inode clusters and existing data will be unaligned but new data will aligned. So better in this case is to backup your fs (xfsdump), mkfs it again with new params and then restore(xfsrestore). Some like GPFS (a cluster fs) handle restriping but they also do a heavy I/O underneath (basically same thing).

    Regarding big ‘intelligent’ caches, if cache satisfies the entire data set then there is no problem to begin with(extremely large cache or extremely small data set).

  5. The term ‘stripe-size’ seems to be non-portable sort of.
    http://support.dell.com/support/edocs/software/svradmin/5.1/en/omss_ug/html/strcnpts.html defines a strip size and a stripe size whereas http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html and http://www.tomshardware.com/reviews/RAID-SCALING-CHARTS,1735-4.html refer to stripe size as strip size referred to in Dell link. Can anyone clarify these ? There is also a ‘stride’ which refers to the single disk one.

  6. I think that “#fdisk -ul” should be “fdisk -l /dev/sda”…

    The first scheduler setting is only for sda and the command to make it permanent changes the default scheduler for the whole system. I’ve used /etc/rc.local to change the scheduler only for a few devices.

  7. Thoughts on block aligning the end cylinder as well.

    Eg

    414538752 / 128 is not block aligned

    So thoughts on using fdisk to set the end cylinder to the biggest block aligned cylinder?

  8. Erm I mean

    856422399 / 128 = 6690799.9921875 – so thats not block aligned. (presume that theres no point unless you preallocate the tablespace to the entire disk?)

  9. @Trent: I don’t see much to worry about regarding the end cylinder, our goal is to ensure our filesystem starts on a RAID stripe boundary, it doesn’t seem important to me if the end falls within a stripe. Indeed, if you have more partitions, you need to stop one block short so you can start the next partition on the stripe boundary.

  10. Ives Stoddard says:

    regarding XFS mount options…

    /var/lib/mysql xfs nobarrier,noatime,nodiratime

    nodiratime is a subset of noatime and never gets checked. so only “noatime” is necessary. details here…

    http://lwn.net/Articles/244941/

    for those curious about the use of nobarrier, keep in mind that some devices don’t support it and it’s disabled by default (such as networked devices)…

    [quote]

    http://xfs.org/index.php/XFS_FAQ

    Write barrier support.

    Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with “nobarrier”. Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:

    “Disabling barriers, not supported with external log device”
    “Disabling barriers, not supported by the underlying device”
    “Disabling barriers, trial barrier write failed”

    If the filesystem is mounted with an external log device then we currently don’t support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn’t support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.

    [/quote]

    sample output from an ubuntu 14.04 server running in AWS EC2…

    dmesg:[6622854.914716] blkfront: xvda1: barrier or flush: disabled; persistent grants: disabled; indirect descriptors: disabled;
    dmesg:[6622854.919286] blkfront: xvdb: barrier or flush: disabled; persistent grants: disabled; indirect descriptors: disabled;
    dmesg:[6622854.947641] blkfront: xvdf: barrier or flush: disabled; persistent grants: disabled; indirect descriptors: disabled;
    kern.log.1:Jun 24 23:52:01 x.x.x.x kernel: [6622854.914716] blkfront: xvda1: barrier or flush: disabled; persistent grants: disabled; indirect descriptors: disabled;
    kern.log.1:Jun 24 23:52:01 x.x.x.x kernel: [6622854.919286] blkfront: xvdb: barrier or flush: disabled; persistent grants: disabled; indirect descriptors: disabled;
    kern.log.1:Jun 24 23:52:01 x.x.x.x kernel: [6622854.947641] blkfront: xvdf: barrier or flush: disabled; persistent grants: disabled; indirect descriptors: disabled;

    my understanding here is that barrier writes are disabled in this case, but might be explicitly required for direct-attached-devices like battery-backed RAID.

Speak Your Mind

*