November 20, 2014

Adaptive flushing in MySQL 5.6 – cont

This is to continue my previous experiments on adaptive flushing in MySQL 5.6.6. Now I am running Ubuntu 12.04, which seems to provide a better throughput than previous system (CentOS 6.3), it also changes the profile of results.

So, as previous I run tpcc-mysql 2500W, against MySQL 5.6.6 with innodb_buffer_pool_size 150GB, and now I vary innodb_buffer_pool_instances as was advised in comments to previous post. I also tried to vary innodb_flushing_avg_loops, but it does not affect results significantly.

So, let’s see throughput with 10 sec averages.

Obviously with innodb_buffer_pool_instances=1 the result is better and more stable.

In fact, if we take 60 sec averages, the picture is following:

On this we could wrap up and be finally satisfied, as we have an almost stable line for Transactions per 60 sec, and we could claim that adaptive flushing algorithm in MySQL 5.6.6 works as expected.

The devil as always is in details. And these details become visible if we go to 1 sec resolution.

This graph is for 1 second averages:

Throughput per 1 sec is scattered through whole range of value, including 0 throughput for 1-4 seconds. So what do we observe there is “micro-stalls”,
which you can’t see if measurement is per 10 sec.

To see it better, let’s zoom-in to 200 sec interval:

In comments to the previous post, I was asked for details.
There is the graph for Checkpoint age

Checkpoint reaches maximal value ( ~75% of 8GB logs) and stays in this level. As I understand checkpoint age balances all time on the limit
and that’s why transactions get stalled.

More metrics and raw results are available from Benchmark Launchpad.

I do not think the current flushing algorithm is good for this workload, and I continue this research as I am interested to find a solution
for cases when we use boxes with a lot memory (100GB+). Percona Server 5.5 also does not handle it quite well, so there is further work to be done.

About Vadim Tkachenko

Vadim leads Percona's development group, which produces Percona Clould Tools, the Percona Server, Percona XraDB Cluster and Percona XtraBackup. He is an expert in solid-state storage, and has helped many hardware and software providers succeed in the MySQL market.

Comments

  1. Vadim,

    Very interesting data. I wonder what is the difference between 2 graphs for 1 sec averages. Did you just zoom in and connected the dots in the graph so you can clearly see dips ?

    I think it is very good of you to start looking at 1sec averages. I think we really even need to go smaller than that – for modern systems even 100ms stall can be quite bad. I remember Facebook mentioning in one of their talks even 10ms micro stalls can cause problems for them with a lot of queries backing up. This makes sense – if you have say 100000 queries/sec coming from 1000 connections or so you might have only few active any given time, while 10ms stall gets you 1000 queries “backlog” which can push MySQL to the level of concurrency which it does not enjoy.

  2. Peter,

    That’s correct. I did zoom in and connected dots, so we can see a pattern: several second top result and then a drop.

  3. James Day says:

    Vadim, interesting. Getting to 75% isn’t really supposed to happen. Adaptive flushing is supposed to be varying up to innodb_max_io_capacity (which is being renamed to innodb_io_capacity_max) to prevent it getting there so I wonder if a higher value for that is needed to avoid getting to async flushing with this configuration. innodb_flushing_avg_loops does limit how fast it can adjust so you might find that some more tweaking of that helps, can’t tell from here.

    Looking at the dips in the short scale averages, what’s the variance like during those times? Many getting through quickly but some picking up a delay or is it an I/O burst stalling all that hit the storage, or those that don’t?

    It might require something like deciding what work to do relatively infrequently but spacing it out more uniformly than the smoothing algorithm does – as I understand it that smoothes the calculation of work to do but not when that work is done.

    While this is usually a capacity planning failure or overload if it happens in production we do want to handle it gracefully. Glad to see it’s looking much smoother on the larger timescales than 5.5 has been, at least longish drops are gone for this combination in 5.6. So maybe celebration of big improvement but noting more work to do will be on the cards for 5.6.

    Peter, I’ve been suggesting a hundred microseconds as a useful maximum target for stalling for some mutex-related purposes – things like scanning a buffer pool during the dump part of dump/reload. Still lots of older code around but we know about the micro stall issue and how higher throughput make its effect worse. It’ll be a work in progress for a while but just like the slow response to KILL work we did over a few releases we’re interested in bug reports of stalling that takes long enough to cause trouble.

    Views are my own, for an official Oracle view consult a PR person.

    James Day, MySQL Senior Principal Support Engineer, Oracle.

  4. Dimitri says:

    Vadim,

    thanks for sharing your results — it’s great to see more various tests around.

    Regarding the Adaptive Flushing — I think the algorithm itself has no problem.. – while I’m pretty sure there are some issues on the flushing ;-)

    I’m also investigating this topic in depth, but missing a time to blog about enough.. – I’m still writing an article which is already covering the problems you’ve observed here.. Hope to publish this week (started to write yet in July ;-))

    Rgds,
    -Dimitri

  5. Dimitri,

    Thanks, so is there a quick solution I can try meantime ?

  6. James,

    Please note this all is on very fast storage, Fusion-io ioDrive.
    I am sure when we come to regular RAID on spinning disks it be worse. There more experiments to run

  7. James Day says:

    Vadim, yes, noted. I’d be surprised if you can’t get spinning disks to have more work created by the foreground than they can flush. Which would end up leading to sync flushing eventually. In production with a typically hilly load I’d look at increasing the log file size if the dirty page percentage wasn’t excessive, looking to survive the peak time and catch up later. Maybe also buy some margin by doing extra flushing before peak time. Not always going to be viable, though. And doesn’t make for easy benchmarking the way that’s usually done with the send as much load as possible until it breaks approach – that end of peak time never arrives. :)

    Knowing that production DBAs try to solve capacity problems I tend to end up more interested in finding how close to fully utilising the most bottlenecking part of the system we can get. So for disks and flushing it’d be how close we can get to 100% I/O utilisation without really undesirable response time patterns.

    What I really want to see is more realistic workloads used in tests so we can see how good flushing is at achieving its twin main goals of decoupling flushing from foreground load peaks and minimising total I/O. The sort of hilly load that many systems tend to have, though not all. Pushing it until it breaks has value but if I’m managing a production server it’s going to be a failure to meet capacity demand if the load ends up flat instead of hilly and I don’t like failure and the loss of money that comes with it. So I’m interested in tools and techniques to raise the maximum capability for the busiest fifteen minutes of the day or the 5-60 seconds after being featured on TV news. That’ll buy better hardware utilisation and lower costs, particularly for big farms that have peaky load. But that’s not an area where I see much benchmarking interest today.

    You might find the discussion at http://blog.wl0.org/2012/04/initial-reactions-to-mysql-5-6/ of interest. Not too many people today using a pair of 16GB log files with 192G of RAM but I do expect to see those with load peaks doing that sort of thing to push the flushing away from their peak load time.

    Views are my own. For an official Oracle opinion seek out a PR person.

    James Day, MySQL Senior Principal Support Engineer, Oracle.

  8. James,

    I would note log files size impacts recovery time and so you can’t just advice larger and larger log files to improve stability. I believe the right algorithm should show stable performance with _any_ log file size, while the performance should be better with larger ones.

  9. James Day says:

    Yes, there’s a crash recovery time impact. It’s a lot faster in 5.5 and later than before but it can still be an issue. We can probably make it faster and the initial indications about use of larger log file sizes suggest to me that we might think that work is worth doing. No guarantees, just the way I’m thinking at the moment and it’s not my decision.

    I agree that stable, but with a foreground slowed down to what the disks can handle, is a good objective for any log file size when the disks just can’t keep up with the level of work that the foreground can submit. Not the highest priority one – that’s things like reduced writes to increase SSD lifetimes or free up I/O operations on spinning disks for other things – but we should handle the case gracefully and better than we do now. That’s just nice handling of exceptional conditions, always a good thing.

    Yes, larger should allow more deferring and better performance at peak times when the disks can’t handle the load at peak times but can handle it over a whole day, or whatever shorter time makes sense given recovery time constraints and a sensible amount of buffer pool used for dirty pages.

    James Day, MySQL Senior Principal Support Engineer, Oracle

  10. Dimitri says:

    @Peter: I agree that adaptive flushing solution should be able to keep workload stable with any redo logs size.. – but the timeframe for 5.6 features was too short here, so we’ve planned to get it working in 5.7, while in 5.6 it should provide you a solution to keep performance stable if you’re not hitting any HW limits (writes are fast enough on your storage, and you have a room with REDO log size now to manage it according your workload)..

    @Vadim: more I’m looking on the stats you’ve provided about your test, and more questions I have.. – well, it’s sure you’re hitting the sync flushing (and adaptive code path is probably even not involved).. From the other hand, on the “stable” period (once I/O reads are nearly finished) you’re having only 6K write operations/sec, which is pretty small for what is going on your server.. If you want we’re looking more in depth – I may prepare a script to collect the stats I’m interesting in during your test, and then I’ll be able to analyze them.

    Rgds,
    -Dimitri

  11. Dimitri,

    I know that MySQL passed only 6K ops/sec, while storage can do much more. I think the problem is inside MySQL and it is able to pass through only that amount of pages/sec.

    So in this workload, even with 5.6, we are not hitting HW limits using Fusion-io.

    Sure, I can run workload with your scripts.

  12. Vadim,

    If I understand you correctly you’re saying the larger the buffer pool is the less IO MySQL seems to be able to drive to IO system, so some testing is due now not only with large amount of cores but also with large buffer pool sizes too.

  13. Dimitri says:

    Vadim,

    sorry for delay — just posted instructions for the script collecting stats I’m needing:
    http://dimitrik.free.fr/blog/archives/2012/09/mysql-performance-collecting-stats-from-your-workload.html

    Rgds,
    -Dimitri

  14. vo says:

    Interesting that nobody had commented on the difference between Ubuntu and CentOS from earlier test.
    It seems Ubuntu provides over 100% more transactions per second (both 95%th and average) than CentOS. Note the different Y scales in the graphs – “tx per 10 s/60 s/1 s”.

    Any ideas on this large performance difference?

Speak Your Mind

*