September 30, 2014

Benchmarks of new innodb_flush_neighbor_pages

In our recent release of Percona Server 5.5.19 we introduced new value for innodb_flush_neighbor_pages=cont.
This way we are trying to deal with the problem of InnoDB flushing.

Actually there is also the second fix to what we think is bug in InnoDB, where it blocks queries while it is not needed (I will refer to it as “sync fix”). In this post I however will focus on innodb_flush_neighbor_pages.

By default InnoDB flushes so named neighbor pages, which really are not neighbors.
Say we want to flush page P. InnoDB is looking in an area of 128 pages around page P, and flushes all the pages in that area that are dirty. To illustrate, say we have an area of memory like this: ...D...D...D....P....D....D...D....D where each dot is a page that does not need flushing, each “D” is a dirty page that InnoDB will flush, and P is our page.
So, as the result of how it works, instead of performing 1 random write, InnoDB will perform 8 random writes.
This is quite far from original intention to flush as many pages as possible in singe sequential write.

So we added new innodb_flush_neighbor_pages=cont method, with it, only really sequential write will be performed
That is case ...D...D...D..DDDPD....D....D...D....D only following pages will be flushed:
...D...D...D..FFFFF....D....D...D....D (marked as “F”)

Beside “cont”, in Percona Server 5.5.19 innodb_flush_neighbor_pages also accepts values “area” (default) and “none” (recommended for SSD).

What kind of effect does it have ? Let’s run some benchmarks.

We repeated the same benchmark I ran in Disaster MySQL 5.5 flushing, but now we used two servers: Cisco UCS C250 and HP ProLiant DL380 G6

First results from HP ProLiant.

Throughput graph:

Response time graph (axe y has logarithmic scale):

As you see with “cont” we are able to get stable line. And even with default innodb_flush_neighbor_pages, Percona Server has smaller dips than MySQL.

So this is to show effect of “sync fix”, let’s compare Percona Server 5.5.18 (without fix) and 5.5.19 (with fix).

You see that the fix helps to have queries running in cases when before it was “hard” stop, and no
transaction processed.

The previous result may give you impression that “cont” guarantees stable line, but unfortunately this is not always the case.

There are results ( throughput and response time) from Cisco UCS 250 server:

You see, on this server we have longer and deeper periods when MySQL stuck in flushing, and in such cases, the
innodb_flush_neighbor_pages=cont only helps to relief the problem, not completely solving it.
Which, I believe, is still better than complete stop for significant amount of time.

The raw results, scripts and different CPU/IO metrics are available from our Benchmarks Launchpad

About Vadim Tkachenko

Vadim leads Percona's development group, which produces Percona Clould Tools, the Percona Server, Percona XraDB Cluster and Percona XtraBackup. He is an expert in solid-state storage, and has helped many hardware and software providers succeed in the MySQL market.

Comments

  1. Olivier Doucet says:

    Hi Vadim,
    Interesting tests as always. Just to share my own benchmarks : this problem can also be seen when storage is fully in memory (using /dev/shm on a Linux server with 128GB of memory). So stalls are definitely not due to the storage itself, but to the InnoDB engine. To my mind, this is a proof that something can be done about this problem :)
    What was the CPU usage you observed ? Is it the same when stalls happens ?

  2. Vadim,

    Great results! I assume this means this change does fix problem for some workloads but does not fix it completely (though makes it better) for others. Indeed it is possible to create the case when it is impossible for server to keep up with flushing of dirty pages as they can be dirtied at faster rate than flushed. The only way to keep uniform performance would be to throttle rate at which pages can be made dirty to match a speed with which disk can write.

    You also call “D….D….D…D” flushes a random write which is not the case as writes are not completely random – they are likely to be physically close and as such no disk seeks is likely to be required (for single drive). Yet it is still far cry from sequential write performance. From what I tested on conventional drives it looks like performing 4 such writes with holes between them makes hard drive to do only one write per rotation which is very expensive.

  3. Vadim: So it is the same benchmark and two different servers? Could you then summarize what is different in the 2 servers that might cause this?

  4. Henrik,

    Yes, this is the same benchmark for two different servers.

    The difference is CPU power and number of CPU cores,
    you can see in one case it is 3500tps, in another over 6000tps.

    There is also different RAID controllers ( LSI MegaRAID vs HP SmartArray), not sure how much
    does it contribute into performance.

  5. Ok, so you have fixed the problem for the HP gear, but insert twice as fast server (let’s assume disk performance is roughly the same) and the same problem comes back. Makes sense.

    So it’s a battle you can’t win without giving up on disks… or at least having really powerful disk subsystem.

  6. Henrik,

    Yes, with current flushing algorithm it is never ending battle.
    I am sure even we have SSD storage, and a lot of memory, say ~2TB, then
    we may see such dips in this configuration. Such configuration is however a far-far future for most MySQL deployments,
    so I hope by that time we will have decent flushing algorithm in InnoDB.

  7. Ryan S says:

    Mr Tkachenko,

    This might be of interest
    http://blogs.innodb.com/wp/2012/04/new-flushing-algorithm-in-innodb/

    Info on O_DSYNC being faster than O_DIRECT
    http://dba.stackexchange.com/questions/1568/clarification-on-mysql-innodb-flush-method-variable/1575#1575

    Even though mysql 5.6 is in development release still I went here to check out the improvements in mysql 5.6
    http://dev.mysql.com/tech-resources/articles/whats-new-in-mysql-5.6.html

    These improvements sound interesting to page flush and i/o problems?

    Explicit Partition Selection, Split Kernel Mutex, Multi-Threaded Purge, Separate Flush Thread, Pruning the InnoDB Table Cache

    This article was for scaling inserts so I dont know if it will help for Selects and Updates
    http://mysqlopt.blogspot.com/2012/06/mysql-how-to-scale-inserts.html

    [Quote]
    “However, you can reach better through put (inserts per second) with less threads. So after dropping number of threads on both clusters by 50% initially – taking it to 20-20 sessions. The problem almost disappeared and when we further reduced number of threads to 10-10 sessions, the problem disappeared!”

    “Also writing the redo logs, binary logs, data files in different physical disks is a good practice with a bigger gain than server configuration” [End Quote]

    This is another article on log file sizes and stalls/lockups, even though the lock issue might be solved for this I still
    think the idea can be used to solve other problems with flushing

    http://www.mysqlperformanceblog.com/2012/05/24/binary-log-file-size-matters/

    [Quote]
    “Here’s what we did: we have reduced the size of binary log file from default 1GB (some systems have it set to 100MB in their default my.cnf) down to 50MB and we never saw this problem ever again. Now the files were removed much more often, they were smaller and didn’t take that long to remove.”

    ‘Note that on ext4 and xfs you should not have this problem as they would remove such files much faster”
    [End Quote]

    innodb_flush_log_at_trx_commit = 0 (this seems to help a lot with write performance)

    The flush problem seems to look like a chasing our tails problem.

    Use more eager writing in parellel instead of lazy writng to disk and set lower dirty ratio to start flushing,
    terminate worker flushing threads at a faster rate.

    There should be better ways of allowing read and writes without locking ie. a special algorithm.

    I think something like the Deadline scheduler would help because it imposes a time limit on I/O request so flushing would not have to wait on stalled read/writes.
    I believe the Deadline scheduler is not the default schedualer of linux but was recommended for ext4.

    If they made an SSD as fast as Ram or if one was to raid 10 a bunch of SSD’s then we could put the Innodb straight to SSD and not even use any Ram and it would remove this problem?

    Including/improving some sort of cache to disk not ram for mysql like varnish for http webserver should speed up mysql since its already technically on the disk.

    Also btw does anyone know when percona 5.6 is going to be released or if I should switch to mysql 5.6 for the flushing and other benfits?

  8. Ryan S says:

    I forgot to add this

    The fractal tree idea sound like it would help increase throughput.

    Use lsm tree for insert/merges and b-tree for search queries, this would be a hybrid setup
    with superindex between the 2 to balance cheap inserts in lsm to b-tree searches.

    The block sizes can be increased which would make fragment clean up on the block easier to do
    or fragment not happening at all.

  9. Alfie John says:

    Was innodb_flush_neighbor_pages removed from 5.6? 5.6.12-60.3 says it’s an unknown variable.

  10. Hrvoje Matijakovic says:

    Hi Alfie,

    Yes, that variable has been removed. Improved InnoDB I/O Scalability patches have been replaced by improvements and changes in MySQL 5.6, although Percona may make improvements in the future.

    This variable has been replaced by the upstream variable innodb_flush_neighbors (http://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_neighbors).

Speak Your Mind

*