November 1, 2014

Spreading .ibd files across multiple disks; the optimization that isn’t

Inspired by Baron’s earlier post, here is one I hear quite frequently –

“If you enable innodb_file_per_table, each table is it’s own .ibd file.  You can then relocate the heavy hit tables to a different location and create symlinks to the original location.”

There are a few things wrong with this advice:

  1. InnoDB does not support these symlinks.  If you run an ALTER TABLE command, what you will find is that a new temporary table is created (in the original location!), the symlink is destroyed, and the temporary table is renamed.  Your “optimization” is lost.
  2. Striping (with RAID) is usually a far better optimization.  Striping a table across multiple disks effectively balances the  ‘heavy hit’ access across many more disks.  With 1 disk/table you are more likely to have the unbalance one disk overloaded, and many idle.
  3. You restrict your backup methods.  You can’t LVM snapshot across logical volumes.

Another common claim with this recommendation is that it allows you to quickly add space when running out.  LVM actually allows you to add physical volumes, and increase the size of logical volumes ;)  This is much easier to do than more one large table around.

About Morgan Tocker

Morgan is a former Percona employee.
He was the Director of Training at Percona. He was formerly a Technical Instructor for MySQL and Sun Microsystems. He has also previously worked in the MySQL Support Team, and provided DRBD support.

Comments

  1. Patrick Casey says:

    Actually, I’d very much love for innodb to properly support this configuration.
    My problem isn’t space, its “protecting” my transactional tables.

    I’d like to put my transaction tables (and log files) on fast high end storage.
    I’d like to put my (very large) log tables on slow SATA drives.
    I’d like to guarantee that if somebody decides to table scan an 80G log file that it doesn’t impact the IO performance of my transactional tables.

    A raid set isn’t going to meet criteria #3 since if I saturate the disks ripping in a log file, the won’t be IOPS available for other transactions.

  2. Yes, unfortunately there is not much you can do in terms of QoS inside MySQL.

    I’d like to be able to pin certain indexes to memory, and I think it would be great if InnoDB supported real table spaces.

    Wishful thinking…
    It would be good to even separate the table spaces between data and indexes. Indexes could go on SSDs, data on SAS disks ;)

  3. See also http://www.sun.com/blueprints/1000/layout.pdf (Sun Performance Blueprint, anno 2000). States the same, with a bit more background.

  4. @Kristian: Thanks! Love the conclusion:

    “Nobody likes surprises in the performance characteristics of a system. Surprises make systems analysts work overtime, and users wait for results. By using large IOs and striping over many disks the performance of the system has been smoothed out. The peaks are not as high, but this is more than made up for because the valleys are not as low. The performance characteristics are more predictable, and therefore easier to get right.”

  5. Andy says:

    What about putting different tables into different databases?

    In that case can I run a single MySQL instance to access the various databases, and put the different databases on different drives?

  6. Patrick Casey says:

    Perhaps I have an unusual workload, but I really do disagree with what seems to be the consensus here, to whit that that a well tuned wide RAID array is going to be the best way to deploy my IOPS to meet my business requirements.

    Fundamentally I have two classes of work in the same database.

    #1 I have transactional tables (and transactions against them) that have realtime requirements (ms matter)
    #2 I have sporadic archival searches against totally different tables which can take as long as they take (second, minutes, hours really)

    It is impossible at any practical price point for me to deploy a RAID array that gives me “enough” IOPS to satisfy group #1 if a couple of users happen to be running group #2 queries at the same time. Given the way INNODB and the underlying disk scheduler routines work, a sufficiently large surge in IOPS will always squeeze out other processes.

    My “workaround” if you want to call it that is to throw enough memory at the system so that my transactional tables are A) memory mapped and B) stay there even when INNODB starts ripping group #2 queries into the scheduler.

    This is really fairly old hat if you’ve been working with relational databases for a long time. For a “general case” workload with unpredictable characteristics, one big raid 10 array is the best “default” choice, but once you understand your IO loading, hand tuning your IO subsystem and putting the right tables/indexes on the right storage is a critical part in opimizing.

    If you have thousands of instances of the same application running on several hundred servers, then you are not looking for a “general case” solution, but rather a very highly optimized solution to a specific problem.

  7. Harrison says:

    At Facebook we’ve been working on adding symlink support for ibd files. I suspect it should be part of our next Facebook @ MySQL code push, but not 100% sure.

    Our use case is for mixed SSD and regular disk setups where we want to put some tables on SSD and others on rotational disks. With table_statistics it is relatively easy to figure out what tables can be put where, so I think this would be more common if mainline MySQL supported table_statistics or something similar.

  8. @Andy – per database works fine with symlinks.

    @Patrick – I see the potential for such optimizations – but they need more server support to be useful for most people (tablespace management, index pinning etc). The current state is mostly a hack which causes maintainability issues when a new person makes the mistake running ALTER TABLE :) There are bound to be exceptions (I apologize if you are one), but I really dislike the general recommendation.

    For a lot of users with mixed workloads, it is very difficult to limit impact from the big expense queries.

  9. Patrick Casey says:

    Morgan,

    I agree that the symlink approach is sort of “hacky”, which is one reason we don’t use it.

    What I would like to see is for the mainline database to support different data directories. Something like:

    create DATA_DIRECTORY slow_directory /sata_array/mysql/data;
    alter table SOME_TABLE change DATA_DIRECTORY slow_directory;
    or
    create table ANOTHER_TABLE DATA_DIRECTORY slow_directory;

  10. @Patrick – MyISAM has this feature. Syntax is:

    CREATE TABLE my_myisam (a int) DATA DIRECTORY=’/path/’, INDEX DIRECTORY=’/path/’ ENGINE=MyISAM;

    InnoDB just ignores these options.

  11. Patrick Casey says:

    Yeah, but for a variety of reasons (table locks, crash recovery, lack of transactional support just to name 3), myisam is an extreme non starter in my environment. We actually ran a mixed myisam/innodb environment in the (somewhat) distant past and switch to full innodb. We ran stats on monitoring outages vs data store and found that, for our workload at least, we have something like 3X the outages/time on myisam than on innodb so it was an easy decision to make.

    Note that I’m not saying myisam is the wrong choice all of the time or will inevitably lead to more downtime in your environment, just saying that in mine we get statistically better stability off of innodb. Editorially I’d almost always look at innodb first on a new project and somebody would have to *prove* to me that myisam was the right choice, but that’s me talking off my gut, not what the data proves.

    Anyway, what I want is that support in INNODB. Ideally I would prefer it to work either with file_per_table or w/o file_per_table (you could imagine two big innodb data files, one in fast_directory and one in slow_directory), but if I had to choose I’d choose file_per_table.

  12. Justin Swanhart says:

    As @Andy said, while it is unsafe to directly symlink an .ibd file, it is supported to use a symlink to an entire database directory. This can be combined with views to move physical access to other devices while logically allowing you to access the table in the other database.

    For example:

    –This database is located on RAID10 and that we already have tables in it,. Also assume that we just added SSD
    –and only want to move certain tables there. Last, assume the application can’t be rewritten to access another database
    –schema or database server.
    create database the_db;

    –This database is going to be created on SSD
    create database fast_disk;
    — Then, in the unix shell, move the directory to SSD and create a symlink to the directory
    — cd /var/lib/mysql
    — cp -p fast_disk /mnt/path/to/ssd
    — rmdir fast_disk
    — ln -s /mnt/path/to/sdd/fast_disk .
    — Now the `fast_disk` database resides on SSD

    All tables created in `fast_disk` will be created on the SSD. ALTER TABLE will work correctly, etc.

    Now consider that there are two sets of tables in the_db, transactional tables and log tables, and the application thinks the log tables are in `the_db`. We know that they are actually in `fast_disk` which will cause a ‘table not found error’ in the application. What we need to do is find a way to make the table appear to be in `the_db` but actually access `fast_disk`, without using a .ibd symlink.

    To trick the appication we can use a VIEW (using the MERGE algorithm) which simulates a database ‘SYNONYM':

    create table the_db.trx1(c1 int auto_increment, c2 int, c3 int, primary key(c1))engine=innodb;
    create table the_db.trx2(c4 int auto_increment, c1 int, key(c1), c5 int) engine=innodb; — c1 is FK to t1

    create table the_db.log(log_id int auto_increment primary key, log1 int, log2 char(10), log3 int) engine=innodb;

    To move the_db.log to `fast_ssd`:

    RENAME TABLE the_db.log to `fast_ssd`.log;
    CREATE ALGORITHM=MERGE VIEW the_db.log as select * from `fast_ssd`.log;

    The database will treat the the_db.log VIEW as a table with respect to SQL. The MERGE view algorithm will ensure that this doesn’t create performance problems. MySQL will simply access the base table (`fast_ssd`.log) instead of the view since there is no aggregation in the view.

  13. Patrick, I think we’re all lined up behind you. We want InnoDB to support these kinds of options. Someday I hope it will.

  14. Justin Swanhart says:

    Yes, I didn’t want to leave the impression that I think that DATA_DIRECTORY would be a bad idea for an InnoDB table – quite the opposite. I just wanted to let you know of a workaround that will allow normal DML on the table. ALTER won’t work, but at least it wont put a copy of the table on the incorrect device.

  15. Steve Jackson says:
  16. Yzmir Ramirez says:

    @Patrick Casey

    You could have an environment with both InnoDB and MyISAM by having read-only slaves that are MyISAM versions of your InnoDB tables; this would mean you could use DATA DIRECTORY and INDEX DIRECTORY.

  17. Céd says:

    It seems like facebook patch can preserve symlinks : http://www.facebook.com/note.php?note_id=478421335932

  18. Oh,

    This is fun topic on so many angles. First not every RAID is a good as you would imagine and because of poor RAID (or its configuration) people are able to get a lot more by manually having many volumes and placing tables on them.

    Also this is often used as premature optimization – doing a lot of labor intensive tossing data around when RAID would just work fine. Having said that there are reasons when you want to have different tables on different devices, and doing it per database is often better choice until Innodb supports data directory/index directory or preserves symlinks. Different kind of storage is the main reason – having some very large rarely accessed tables on SATA or having very hot tables on flash can be good

  19. jackbillow says:

    split .ibd file across mutiple disks will increase the cost of maintenance.

  20. Patrick Casey says:

    Optimizations almost always involve a tradeoff between simplicity and performance.

    If configuration A is always easier to maintain and runs faster than configuration B, then everyone will run A (it’ll be the default).

    The nature of hand optimizations is almost always that the “default” isn’t behaving appropriately in a particular use case so we’re will to implement something more complex/harder to maintain/etc in the name of performance, fault tolerance, or some other business critical metric.

    The challenge in my experience is that performance tuning is sort of sexy, at least more so than normal DBA/Sysadmin duties. So there’s a lot of unnecessary performance tuning done out there by people because its “fun”, resulting in frail systems that don’t perform materially better than a default deployment but still manage to be excessively complex. The same thing, for what its worth, is true in software design where there’s an axiom that runs something like “premature optimization is the result of all evil”.

    Anyway, my point I suppose is that there is a time and a place for hand optimization (and clearly I think I’m at that point in my largest deployment), but I also agree with the consensus here which seems to be “be very, very, sure you A) really need it and B) can maintain it before you start doing exotic performance tuning”.

  21. I 100% agree with this: “there’s a lot of unnecessary performance tuning done out there by people because its “fun”, resulting in frail systems”

  22. Baron, seconded. I think it would be interesting to see a post on premature optimization :)

    I chose a somewhat-controversial title, but premature optimization is really the underlying theme here.

  23. sbester says:

Speak Your Mind

*