November 28, 2014

Disaster: MySQL 5.5 Flushing

We raised topic of problems with flushing in InnoDB several times, some links:

InnoDB Flushing theory and solutions
MySQL 5.5.8 in search of stability

This was not often recurring problem so far, however in my recent experiments, I observe it in very simple sysbench workload on hardware which can be considered as typical nowadays.


Hardware: HP ProLiant DL380 G6, with 72GB of RAM and RAID10 on 8 disks.

I took sysbench multi-tables workload, with 20 tables, 10,000,000 rows each. Total database size ~58GB.
MySQL version: 5.5.16

Initial benchmark, which InnoDB configured for this hardware

innodb_flush_log_at_trx_commit = 2
innodb_flush_method = O_DIRECT
innodb_log_buffer_size = 16M
innodb_buffer_pool_size = 52G
innodb_log_file_size = 1900M
innodb_log_files_in_group = 2

In my benchmark I measure results each 10 sec, and when we graph it, it looks like:

Basically there two stages: MySQL handles traffic and MySQL stuck. As you see stalls are very severe.
There are 4 mins periods, when MySQL is not able to process query.

Why this happens: referring to previous InnoDB Flushing theory and solutions post, InnoDB during this time is “sync” flushing mode. It allows itself to go in this state,
as we have a lot of memory, and we can do changes in memory much faster than on disk. InnoDB is not able
to catch up with flushing changed data.

With widely adoption of MySQL 5.5 and servers with big memory configuration I expect we will see this problem more and more often on production systems.

How to diagnose it?
This is question I am asked, how do we know that stall we see is related to InnoDB flushing and “sync” state.

Vanilla MySQL does not provide much diagnostic there, but there couple numbers to look into.
If you take “SHOW ENGINE INNODB STATUS”, look into following sections:

Log flushed up to 135550126707
Last checkpoint at 132524978607
...
Pending writes: LRU 0, flush list 105, single page 0

First, Pending writes “flush list” > 0 says that InnoDB does perform a lot of flushing,
and if this number grow, that means InnoDB flushes more than your IO system can afford.
Second, you need to a little math. (“Log flushed up to” – “Last checkpoint at”) this is our checkpoint age. In our case it is 3025148100 or 2885M . Our summary log size is 3800M. InnoDB usually takes 75% mark as limit for “sync” stage ( 3800M * 0.75 = 2850M ). That is checkpoint age exceeds 75% sync mark, that is signal that InnoDB performs in “sync” flushing mode.
So math formula for this: if (“Log flushed up to” – “Last checkpoint at”) > “total log size” * 0.75 , then InnoDB is in “sync” mode.

What to do?
I whish I could say you should use Percona Server with XtraDB.
If we were using SSD as storage, then I would recommend it. Vanilla MySQL performs equally bad on SSD and
HDD, while for SSD in Percona Server we have “innodb_adaptive_flushing_method = keep_average”.

Unfortunately on spinning disks (as in this case), Percona Server may not show significant improvement. I am going to followup with results on Percona Server.

So first recommendation you may hear in this case from the Oracle/MySQL engineers is to decrease “innodb_max_dirty_pages_pct”, this way InnoDB will try to keep less dirty pages in buffer pool,
and hopefully it will spend less time in “sync” flushing.

Let’s see what we can get if we set

innodb_max_dirty_pages_pct=30

Although maximal throughput decreased and stall period is somewhat shorter, I cannot see this is helpful.

Second action what you may try, is to decrease innodb_log_file_size. How this may help ? InnoDB flushing may kick-in faster, and do not allow to have a lot of modified pages in buffer pool.
Let’s try it:

Well, stability and absolute value of throughput are far from perfect, but, at least, we do not have 4 min stalls.

Will it help if we decrease innodb_buffer_pool (effectively killing idea that more memory is better), the same way InnoDB will not be able to change a lot of data :).

The following results are with innodb_buffer_pool_size=39G. There I keep Y scale as first graph to show impact on performance by this action.

Finally we got somewhat stable result without stalls, but by loosing about 20x throughput.

These facts give me idea that existing InnoDB flushing algorithm is not suitable for modern hardware with a lot of memory. I hope there is work in progress for MySQL 5.6.

Scripts and raw results you can find there.


About Vadim Tkachenko

Vadim leads Percona's development group, which produces Percona Clould Tools, the Percona Server, Percona XraDB Cluster and Percona XtraBackup. He is an expert in solid-state storage, and has helped many hardware and software providers succeed in the MySQL market.

Comments

  1. tobi says:

    How nasty. It is hard to believe that a nicer flushing algorithm is so damn hard to implement. The effort would be well-spent.

  2. Rob Wultsch says:

    The normal way of setting up innodb for a long time has been to use flush method as o_direct. With RAM not allocated to the buffer pool, would it make sense to change the flush method?

  3. Mark Callaghan says:

    Is this the same problem as http://bugs.mysql.com/bug.php?id=59291

  4. Mark,

    It seems it is the same problem, however I do not think this is matter of bug fix.

    We did some research, and I believe it should be different algorithm than InnoDB uses right now in 5.5 and in 5.1-plugin.
    The current algorithm is not able to cover all cases.

    And I think it is possible to come up with really good algo and following should be taken into account:
    1. Algorithm should not depend on io_capacity. This way we cover such cases as RAID disk or battery failure, or periods when we take backups
    2. It is quite obvious that we can do changes in memory much faster than on disk, so algorithm mush provide some kind of throttling on changes in buffer_pool, and by this protecting flushing from saturation and avoiding stalls. That is it should insert delay into transactions.

  5. Kes says:

    Out of curiosity, what is the performance when you alter some of the InnoDB disk parameters. The default assumptions for innodb_io_capacity , for example, are based on a system with only one physical hdd running at 5.4k-7.2k rpm. The default for innodb_write_io_threads additionally seems out of place in a modern system which has several physical and virtual cores that can process threads. Personally, this seems like an issue where InnoDB is entirely too pessimistic about what the physical disks look like, and defaults to options that are good if you’re running a 5 year old desktop, but not for the hardware modern servers of any substantial size have.

  6. David Jiang says:

    Hi, Vadim

    I think this not about to InnoDB’s flushing algorithms, but the current limit size of redo log.
    In MySQL 5.6, the redo log can set to 2T, which can help InnoDB to reduce stuck situation with large buffer pool.

  7. David,

    Actually I think if you increase size of redo log – it will give only bigger period of stalls. I.e. instead of 4 min stalls, you will have 8 mins. Because the bigger log size is – the longer InnoDB is “being” lazy and postpone active flushing.

    I may try to run it with 5.6 to show it.

  8. The algorithm is one thing, but has anyone checked that this isn’t another one of those cases where InnoDB uses linked lists or some other data structures that simply become too slow to work with when the buffer pool and log size grow to modern sizes?

    Just to give a different perspective on how this could be solved, I tried to think how NDB behaves. Let’s compare:

    NDB has database in memory. InnoDB has buffer pool in memory.

    NDB writes REDO log. InnoDB does too.

    NDB has a background thread that periodically dumps contents of full in-memory database to disk, known as local checkpoint. (At a constant interval, or if the dump takes longer, as soon as previous dump is finished.) InnoDB does not do anything exactly like this for the buffer pool, evicting pages from cache is not the same thing.

    When NDB redo log file is full, it refuses to accept more transactions (returning error) until there is room again. When InnoDB log file is full, transactions get stalled until they can be completed. (This means NDB is hard real-time database and InnoDB is not. InnoDB is more newbie friendly because you don’t see the problem :-)

    When NDB redo log file is full, the fix is to wait until the next/ongoing local checkpoint is finished. At this point it is possible to discard the contents of the redo log file and start from the beginning again (I think it actually is a circular buffer). When InnoDB log file is full it has to start reading its contents from beginning, propagating changes back into buffer pool (ok, they already are in buffer pool) and flushing those buffer pool pages back to disk.

    The question is, if I have a 100G or 200G buffer pool and heavy write workload, would it be advantageous to just periodically dump the buffer pool to disk instead of trying to come up with a better algorithm for how to clean up the redo log? The dump of course has to be done in some kind of MVCC manner so that it is consistent, but other than that, writing 100G to disk is quite straightforward and doesn’t require elaborate algorithms.

    Note that in the NDB case, if I run out of redo log space, I can just increase the size of the redo log file. There is no penalty for doing so, because there is no need to go back and work through or “cleanup” stuff that is written into the redo log. The time it takes to dump contents of the in-memory database to disk is constant, or rather it depends only on size of in-memory database and things like throughput of your disk device. With InnoDB on the other hand, increasing log file size doesn’t really help, it just makes the problem worse.

  9. jordani says:

    It seems to me that too much data are written in sync mode at once. (How much in fact?)

    Is this number calculated as percentage of log? If this is true can you make patch to write some number of MB (as parameter) instead percentage of log?

    There was similar problem in linux – dirty_ratio / dirty_bytes.

  10. dengcheng he says:

    Hi, Vadim
    Excellent test. Thanks for sharing.

    If we set innodb_flush_log_at_trx_commit to 1, can we avoid this situation?

  11. Dimitri says:

    Hi Vadim,

    any chance to replay the same test:
    – without O_DIRECT ?
    – on MySQL 5.6 with redo logs of 64GB ?

    also, you did not mention how many concurrent sessions were used, as well is it some special Sysbench? (if so, where can I get it?) and probably exactly the command line you’ve used.. – it’ll be interesting to reproduce it and analyze the issue..

    Thank you for sharing!!!

    Rgds,
    -Dimitri

  12. inaam says:

    Vadim,

    Theoretically larger redo log size should help (the reality, though, is often different from theory). Lets assume that whatever flushing algorithm we are using takes n seconds to flush the entire buffer pool to disk. The workload, unconstrained by IO as the all data is cached in the memory, can generate a specific amount of redo in n seconds, say x GB. Note that time taken to generate x GB redo won’t be a function of buffer pool size as once all the pages are dirty the redo will still be generated without adding to dirty page count as we keep on revisiting the already dirty pages.

    Now lets assume we set the total size of redo to 2x GB. As the workload starts and touches each page in the buffer pool lets assume for the sake of simplicity that no flushing is happening. Once x GB redo is generated all pages will be dirty. In the next n seconds the workload will keep touching the dirty pages and the flushing algorithm will keep flushing the pages in the background. Most probably any page flushed by the page_cleaner will be redirtied by the workload but it’s oldest_modification LSN will be higher. After 2n seconds all changes belonging to the first n seconds must have been flushed to the disk and many newer changes must have also been flushed due to write combining. As such we’ll have x GB of reusable redo space available come the time of checkpoint.

    Makes sense?

  13. Dimitri,

    I will repeat with parameters you suggest.

    My parameters and tests are in scripts on Launchpad,
    https://code.launchpad.net/~vadim-tk/percona-benchmark-result/mysql-flushing-problem
    it was 32 concurrent sessions.

  14. inaam says:

    Henrik,

    A minor correction. You said:

    “When InnoDB log file is full it has to start reading its contents from beginning, propagating changes back into buffer pool (ok, they already are in buffer pool) and flushing those buffer pool pages back to disk.”

    We don’t need to read the contents of redo when doing a checkpoint. Redo log is flushed to disk at transaction commit and is only ever read during crash recovery. At checkpoint time we only flush the dirty pages to the disk.

  15. Henrik,

    We have looked into, I do not think there linked lists or other bubble sorts or other funny O(2^n) problems in this case, it is just speed problem – changes in memory are faster than changes on disks.

    You proposal to dump 100-200GB at once or something some simple than current implementation may work on SSD.
    On spinning disks I feel it will be much problems, it will still take long time to write 200GB of data.

  16. Inaam,

    You say
    ” .. After 2n seconds all changes belonging to the first n seconds must have been flushed to the disk and many newer changes must have also been flushed due to write combining … “,
    It may be not case, it totally depends on write distribution into buffer pool pages. Say we have so nasty workload, so it changes
    only 1 row from each page, but it touches many pages. So with this we can produce a lot of work for page_cleaner.

    And I feel that we will need *GIANT* log files, like 64GB+, if we have giant buffer_pool, to avoid sync state, which may be not healthy for recovery time.

  17. James Day says:

    I agree that we should be able to throttle foreground load, much as we eventually do now if innodb_max_purge_lag is set.

    The page cleaner thread and tweaks in 5.6, as well as the better monitoring that’s been added, may be useful, since it’ll say how the flushing is being done.

    I don’t see a problem having 64GB or more of log files for 64GB of buffer pool. My desktop box if it had 64GB of RAM would have that much for a hibernate file and swap file. Disk space is just a tool to use. Agree that recovery time could become an issue, though we made it perhaps thirty times as fast in 5.1 plugin and later without parallel recovery, so there’s still room to make it faster.

    I don’t want us to be limiting the throughput for the busiest two or three hours of the day if bigger log files can defer some work and allow the server to catch up at a quieter time, provided it doesn’t reach the point where recovery time becomes unacceptable. A trade of disk space for higher peak throughput is a good one to make, within sensible limits that might vary for different customers and applications. The transition from peak to sustainable also needs to be a polite one.

    James Day, MySQL Senior Principal Support Engineer, Oracle UK

  18. Dimitri says:

    Vadim,

    from your “iostat” files looks like your storage is doing easily 4000+ writes/sec.. – any reason you’ve used io_capacity=600 instead of 5000 for ex.?..

    alson may you, please, point me exactly to the sysbench distro you’ve used during the test? (sorry, did not find anything from the link you sent)..

    Thank you!

    Rgds,
    -Dimitri

  19. Dimitri,

    I used different innodb_io_capacity settings, included 4000.
    sysbench you need to create from source code lp:sysbench.

  20. I do not think that running without O_DIRECT is a solution. I’m not sure if anyone is suggesting that, but if so, then it’s not a realistic fix. InnoDB needs to either be able to manage its own memory, or let the operating system do it as Postgres does, but double buffering is way too expensive to run in production. If the problem doesn’t manifest without O_DIRECT then that could be useful for understanding the nature of the problem, though.

  21. To update on provided suggestions:

    – Tried without O_DIRECT, it does not show significant difference.
    – I am trying 5.6.3 with innodb_log_file_size with 64GB total size.
    I gave up on log creating procedure as after 2 hours, InnoDB was on about ~50GB mark and progress seemed slowing down with each GB…
    Now I am trying with 48GB log size.

    The problem with such log sizes is that logs do not respect O_DIRECT, and takes OS cache, which in our
    case takes 48GB, which affects general performance.

  22. Dimitri says:

    @Baron: I did not say that not using O_DIRECT is a kind of tuning, it’s just to better understand the problem is O_DIRECT matters here..

    @Vadim: it’s good to know creating the huge REDO log file may take so long time.. – will see what is wrong here. And the overall issue is quite interesting.. – thanks to point on it! :-)

    BTW, you’re site is not sending emails anymore to notify about new comments – I’d say it’s quite painful to keep in mind to come back again and check if there any answers.. May you fix it?

    Rgds,
    -Dimitri

  23. Mark Callaghan says:

    Why, why, why when doing a flush_list flush of the buffer pool to write all dirty pages that have been dirty for too long must InnoDB hold the buffer pool mutex for the entire operation? It can do hundreds of thousands of page writes in that time. In addition to figuring out how to balance the creation of dirty pages with the writing of them, there are two simple changes:

    1) release the buffer pool mutex now and then when buf_flush_batch(BUF_FLUSH_LIST, …) has been called with lsn_limit==IB_ULONLONG_MAX.
    2) don’t write dirty neighbor pages in this case (or don’t write them once too many pages have been written) as that can multiply the number of writes by a factor of 64 — all while the buffer pool mutex is locked

  24. Mark,

    We are testing new mode
    innodb_flush_neighbor_pages=”seq”
    which will flush only real sequential neighbors.
    It shows promising results.

  25. inaam: ok, good point. But conceptually what I was referring to is just the flushing of dirty pages, whether they are found by reading the redo-log or some other data structure, it needs to be done, and the benefit of doing so is that you can discard the older end of the redo log. So actually, what innodb does is already close to what I proposed then, but for some reason either the data structure keeping track of dirty pages, or the algorithm doing the flushing, become more complex.

    Vadim: Note that even if flushing all of the buffer pool sounds scary, you can decide the duration of that operation yourself (ie the bandwidth it takes towards the disk). You only need to make your redo log file(s) big enough that they can contain redo-operations for the whole time of such a full dump. But as pointed out, increasing the redo log size has no cost other than cost of disk space.

    Also, you could naturally skip non-dirty pages, should there be any.

    I’m of course talking here of a code base I have never even looked at, but just trying to present some simple thinking…

  26. inaam says:

    Mark,

    We do NOT hold buf_pool->mutex over the IO operation. For each dirty page that we flush, we do release the buf_pool->mutex. It is held continuously only when we are iterating over the list.

  27. Mark Callaghan says:

    Inaam,

    I am pretty sure we discussed this before. Domas has a support group for people who tell me things that I then forget. But this stall is really bad regardless. And suggestion #2 above (don’t flush dirty neighbor pages when too many pages must be flushed) is probably worth doing.

    Another thing we discussed is what buf_flush_batch in MySQL 5.1 or 5.5 does when it finds a dirty page. I don’t remember the details of that discussion, but AFAIK the code in 5.1 is prone to holding the buffer pool mutex for too long. Searches for dirty pages start at the end of the flush_list or LRU. After finding a dirty page and scheduling it and its neighbors to be written (and briefly releasing/relocking the buffer pool mutex) the search for dirty pages starts again from the end of the LRU or flush_list. Each time this search restarts it has to scan more of the list to find the next dirty page.

    Comments in buf_flush_batch match my understanding:

    /* Note that after finding a single flushable page, we try to
    flush also all its neighbors, and after that start from the
    END of the LRU list or flush list again: the list may change
    during the flushing and we cannot safely preserve within this
    function a pointer to a block in the list! */

    So does the code …

    if (ready) {
    space = buf_page_get_space(bpage);
    offset = buf_page_get_page_no(bpage);

    buf_pool_mutex_exit();

    /* Try to flush also all the neighbors */
    page_count += buf_flush_try_neighbors(
    space, offset, flush_type);

    buf_pool_mutex_enter();
    goto flush_next;

Speak Your Mind

*