InnoDB: look after fragmentation


One problem made me puzzled for couple hours, but it was really interesting to figure out what’s going on.

So let me introduce problem at first. The table is

Table has 11864696 rows and takes Data_length: 698,351,616 bytes on disk

The problem is that after restoring table from mysqldump, the query that scans data by primary key was slow. How slow ? Let me show.

The query in question is (Q1):

On cold buffer_pool, it took:

However the query (again on cold buffer_pool) (Q2)

Difference is impressive. 4 min 13.61 sec vs 18.81 sec

If you want EXPLAIN plain, here it is:

For Q1:

For Q2:

Query Q1 is executed using Primary Key, and Query Q2 is using block_id key.

To get more details I ran both queries with our extended stats in slow.log (available in 5.0-percona releases)

So for query Q1:

And for query Q2:

As you see for Q1 IO read took 236.149003 sec vs 0.165124 for Q2. But Q1 is scan by primary key, which supposed to be

Let’s see on another statistic, which available in innodb_check_fragmentation patch:

for Q1:

and for Q2:

So you see for Q1 it was not sequential scan, even it is primary key, but it is sequential for Q2.

So what’s the answer ? It’s fragmentation of primary key (and whole data table, as InnoDB data == primary key). But how it could happen with
primary key after mysqldump ? The answer here if we look on


We see that dump is taken in key “block_id” order, not in primary key order. And later when we load this table, INSERTS into primary key happens in random order, and that gives us the fragmentation we see here.

How to fix it in our case. It’s easy:

, it will force InnoDB to rebuild table in primary key order.

After that Q1:

You see that time returned to appropriate 17.72 sec.

You may ask what happens now with Q2 ? yes, it’s getting slow now, as we made key “block_id” inserted not in order.

As for mysqldump you may use

options to force dump in primary key order.

So notes to highlight:

  • InnoDB fragmentation may hurt your query significantly, especially when data is not in buffer_pool and execution goes to read from disk
  • Fragmentation by secondary key is much more likely than by primary key, and you cannot really control it (tough it is possible in XtraDB / InnoDB-plugin with FAST INDEX creation) so be careful with queries scan many records by secondary key
  • To check if you query affected by fragmentation you can usestatistics in 5.0-percona releases (coming to 5.1-XtraDB soon)

Insight for DBAs, MySQL


  • What happened to these variables in later versions? In version Server version: 5.1.55-rel12.6-log Percona Server (GPL), 12.6, Revision 200 the “innodb_scan%” status variables do not exist, but in Server version: 5.0.92-50-log Percona SQL Server, Revision 85 (GPL) they do exist.

    Furthermore, the patch that contains these is documented https://www.percona.com/docs/wiki/patches:innodb_check_fragmentation but there’s no indication they are deprecated…..or what replaces them….and sadly, https://www.percona.com/docs/wiki/percona-server:features:indexes:variable_reference does not reference those variables or the patch itself.

    What happened? Is there another place I can find this information?

  • Vadim,

    I would mention it looks like it is caused by the bug to start with.

    SELECT * FROM table

    should not be using index but rather primary key. The data is clustered by primary key so it is generally faster.
    Also SELECT * FROM table can be only “using index” if there is a key which (together with primary key) covers all columns which means it is not going to be shorter than clustered (primary key) and so even in case they are both fragmented it should not be slower.

    I suspect the problem here may come with costs – because Innodb stats are inexact it may look like there are a lot less records to scan using one index than other, while of course if you’re doing full table scan or full index scan number of records is going to be same.

    Great fine BTW and nice to see our stats are working :)

  • Interesting case.
    Technically it was an equal choice in terms of index selection, since the indexed column (block_id) plus the composite primary key covers all the columns. So it was just as valid to pick that index as it was to just scan the primary key. And it would be just as fast.

    Of course, when using the output from that query to insert into another table (via a dump or just INSERT … SELECT), and the destination table is InnoDB, then it’s more beneficial to have the select in primary key order. But hey, we all know that unless you specify an ORDER BY, result set order is undefined.
    I don’t think this is a optimiser bug, really. If you want the select ordered, use an ORDER BY clause.

    And indeed, mysqldump offers the –order-by-primary option, and using it with InnoDB is good. For other engines, the situation can be quite different. For MyISAM, ordering by primary may in fact trigger a filesort (even go to disk) since a full table scan is likely not in primary key order. It’s arbitrary (well, based on insert order + gap filling).

  • Arjen,

    Right. From the size prospective the indexes are about the same. Primary key is even a bit larger as it has more information in it than primary key. However it is still better to scan in primary key order in most cases because it is less fragmented. It is very frequent to see auto increment primary key or other sequential insert patterns.

  • Can you say more about the nature of the fragmentation?

    Even if the dump occurred in non-pk order, the pk order should have been restored during the import, no?
    I thought that the price for the out-of-order restore would be excessive i/o during the restore as the
    b-tree is continually re-organized to cope with the out-of-order inserts. After the restore the table must
    be in pk order, no?

  • One thing that was great about mysisam was ALTER TABLE ORDER BY, so you could decide how to order your data…

  • Migael, in InnoDB’s architecture, it works out faster to insert in primary key order. Yes the B+Tree gets rebalanced either way and a B+tree is by definition sorted, but there’s just less work in this particular scenario. That’s why it’s desirable that a dump for InnoDB tables has the rows in PK order.

  • Miguel,

    Indeed Innodb clusters data by primary key. This clustering is however per page. For example in case we would use MyISAM we could get a “disk seek” and IO for each row in worse case scenario – in Innodb it does not happen. The data is always going to be clustered on per page basics and hence at least 16K worth of data would be read each time. The number of rows of course depends on row length and page fill factor – in this case we had few hundred rows per page.

    Now note in the worse case scenario no read-aheads will trigger and all IO will be done by single thread in 16K blocks. Considering 200 IOS/sec for legacy (non Flash) hard drive you will be looking at about 3MB/sec read speed which is 30-50 times slower than sequential read speed of the same drive.

  • Vadim
    in this case / with that special purpose in mind why do you compare the runtimes of different queries?
    SELECT count(distinct username) FROM tracker where TIME_ID >= ‘2009-07-20 00:00:00′ AND TIME_ID = ‘2009-07-20 00:00:00′ AND TIME_ID <= '2009-10-21 00:00:00'

    At first glance one would use identical queries and FORCE INDEX usage, wouldn't one?

  • Frank,

    I compared different queries as they use different indexes.

    Q1 uses primary key and Q2 uses block_id

    The idea behind that is if you run query that you think should be executed by PRIMARY KEY you will not
    use FORCE KEY (secondary_key) on it. By PRIMARY KEY is usually much faster for InnoDB.

  • I just ran ‘ALTER TABLE tracker ENGINE=InnoDB'; it takes forever with 20 million rows. I wonder if there’s a better way to defrag on production! :)

  • After several trials I am doing “defragmentation” of innodb 10M+ rows table by exporting to csv file and then loading to new table. Then I rename them, copy all changes which occured during export/import and delete old table. quite fast compared to other methods.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">