EmergencyEMERGENCY? Get 24/7 Help Now!

How small changes impact complex systems – MySQL example

 | December 31, 2014 |  Posted In: Benchmarks, MySQL, Percona Cloud Tools, Percona Server for MySQL


If you’ve been studying complex systems you know what minor changes might cause consequences of much greater proportions, sometimes causing some effects that are not easily explained at first. I recently ran across a great illustration of such behavior while doing MySQL benchmarks which I thought would be interesting to share.

I’m using a very simple benchmark – Sysbench 0.5 on Percona Server 5.6.21-70.1 just running update queries:

Some people frown upon such benchmarks due to their triviality and being irrelevant to workloads. I like them because they often allow you to study already complex system behavior in a much more controlled environment than “real” workloads – and so you can understand internal dynamics better.

For this test I set innodb_purge_threads=8, innodb_max_purge_lag=1000000, innodb_max_purge_lag_delay=5000 as otherwise chances are that the history will just be growing forever not allowing the system to reach the steady state. I wrote about InnoDB history problems in detail recently in a post titled “InnoDB transaction history often hides dangerous ‘debt’.

The only change I was looking for is changing innodb_flush_log_at_trx_commit from 1 to 2. I then went back to 1 a few hours later, after the system had reached the steady state. What I would normally expect in such a case is a performance increase because there is less work to do on commit and that is pretty much it. Looking at the graphs however I can see much more changes in the system. Let me go through them and explain the ones which I can:

The throughput graph is probably the most interesting here. We can see that indeed the throughput has substantially improved when set innodb_flush_log_at_trx_commit=2 and then went back to the old one after it was set back to 1. We see however in 2 some interesting effects here – a spike of a few minutes and later a dip below the original performance level for a few hours, until finally recovery back to the baseline. I’ll explain this with different graphs.

CPU usage is interesting, too. We can see there is user CPU increase correlating to the throughput increase but also a drastic IOWait increase. Why would it increase when we’re actually making workload less IO bound by not flushing log as much? The reason is because there are more threads now that can perform IO concurrently, hence increasing what is reporting IO wait and deceasing idle time.

InnoDB writing much more data is expected and it very well matches throughput. Not much of anything unexpected.

As we see setting innodb_flush_log_at_trx_commit=2 causes the amount of dirty pages in the buffer pool to go down significantly. It recovers when it is put back. The reason I suspect in this case is the undo space – as we will see in later graph history length spikes quickly – obviously requiring some space in the buffer pool.

This is perhaps one of the most interesting graphs here. You can see for some reason when log flush is enabled that purge threads are able to keep up. But if you disable it they cannot, which makes history grow very quickly until the max allowed amount. This is by the way the time we see the spike in the throughput as there is no throttling being done to keep history in check. This also explains why after I set innodb_flush_log_at_trx_commit back to 1 that performance first drops well below original levels – InnoDB needs to go through and purge the history, which takes resources. Another interesting thing you can notice is how uneven history cleanup is. Specifically you can note some 75% percent of the history has been cleaned up almost instantly while initial 25% took a lot of time to work out. I do not have an exact explanation here yet I believe it has to do with the fact initially history purging requires a significant amount of IO while at a certain point in time a purge happens completely in the buffer pool and hence being much faster.

Why would purge threads have a hard time keeping up when log flush is disabled? I suspect there are 2 reasons. First, the system can provide much more throughput in such configuration (not keeping purging into account) while purge subsystem is limited and can’t keep up with such a high load. The other reason is – eliminating serialization on the log flush exposes other parts of InnoDB to contention which further suppresses purge system progress.

This is another relevant graph to show the throttling done because purging can’t keep up. As you see we got about 2.5ms delay injected for updates to give purging some time to keep up. It is good to watch it to understand if the max delay you specified is enough to prevent purge lag while still avoid having the system grind to a halt. Had I set innodb_max_purge_lag_delay to 1000 it would not be enough to prevent history length growing to infinity, having I set it to 100000 the system could grind to a halt in case there is some long-running transaction causing history length to grow to about 1 million.

Lets look at Physical IO – it is also quite interesting. As you can see in the first and last parts of the graph there are more pages flushed than pages read, while when we disable log flushing we get much more reads than writes which only stops after history purging catch up is complete. I find this very interesting – this means that working with even modest history – 1M in this case causes a lot of IO – I assume because due to access pattern to the undo space it is constantly being washed out from the buffer pool. This graph is a very good illustration why large history might be a death spiral, as once it is there the system needs to do much more work.

This IO utilization graph might be a bit hard to understand but what it generally shows is that during the time the hard drive has been 100% utilized, there was still at least one outstanding IO to serve. However the concurrency was different with more average IO requests outstanding when we had log flushing disabled. This makes sense. We can also see it was highest when we enabled log flushing back but there was history to go through and catch up.

Changes to contention are very curious – we can see there is a lot more spin rounds and specifically shared lock spin rounds when we have log flush disabled. This is what I believe comes from less serialization on the log flushing causes a lot more internal contention.

It might not be very easy to see but this graph shows that innodb_row_lock_times spikes 1000x when log flushing is disabled. Why would that be? I suspect the reason is injected delay for purging. I also suspect the design of this throttling so it only is done when an actual change is attempted to be made, this is where update statement already holds some locks, which causes more row lock contention for other statements.

InnoDB has something called Server Activity count. As you can see from this graph it is not really a good indicator – for some reason it shows a lower value when the server was doing much more work. I do not understand other than this variable is incremented enough to explain it but find it curious.

LRU flushing also is drastically different between enabled and disabled log flushing. We get a lot more calls to search free pages and a lot more scans. I’m not sure why exactly this is happening but I think this has something to do with undo space caching which is constantly washed out from buffer pool which increased IO seems to indicate.

Finally lets look at perhaps the most counter-intuitive result – the amount of pending pwrites to the log file – we allowed to cache more log files in the OS cache but the amount of waits on writes to the log files actually increased? I’m not very sure why this is happening – perhaps there is a lot of IO pressure on OS so it does not delay writes to the log files for long time?

This is the same graph showing log fsyncs and log pwrites together. We can see when we set innodb_flush_log_at_trx_commit=2 that we see a lot more waits on pwrites, while less waits on fsync (which now should only be happening once per second or so). In general these two graphs tell me in some cases assuming log writes with innodb_flush_log_at_trx_commit=2 will be “free” because the operating system will instantly acknowledge them and put in the cache is not always correct.

Summary: The main point I’m looking to illustrate here is what changes you’re doing to the MySQL configuration (same can be said about your application OS configuration, hardware, etc.) can cause much larger changes to system behavior than you would guess. If you have the time it is good to validate whenever there are any other changes to system behavior beyond what you would think. If not at least keep an open mind and assume what any change, even a small one, can cause dramatic system behavior changes.

If you want to explore similar graphs check out Percona Cloud Tools – it’s easy to get going and allows you to build graphs like these absolutely free and perform a lot of in depth analyses. Now in free beta, Percona Cloud Tools is a hosted service providing access to query performance insights for all MySQL uses.

Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.


  • A possible explanation for the increase of innodb_row_lock_times is that there exist secondary indexes in the table. The check for implicit record locks on secondary indexes is computationally expensive, because it often involves a lookup in the clustered index. It is analogous to how MVCC works in the secondary index. I blogged about it some years ago; search for DB_TRX_ID in https://blogs.oracle.com/mysqlinnodb/entry/mysql_5_5_innodb_change

    This might be helped by introducing DB_TRX_ID and DB_ROLL_PTR fields in secondary index records.

  • https://pbs.twimg.com/media/BwwxR0-IIAA8fso.png:large

    can see the difference on a slave (with study on thread replication)

  • Marko,
    Thanks for comment. I think it is fun to do some math here. We have about 30K worth of milliseconds or 30 seconds for row locks for every second. There is throughput of about 3QPS of updates at this time, which means we have approximately 10ms lock wait per update… thats a lot. The purge sleep delay is only some 2.5ms so we wither have in average more than one wait for purge delay wait per update query (which is possible with hotspot) or there is also other impact.

  • Peter, I forgot to mention that the MVCC and checks for implicit record locks on secondary indexes get slow when there are lots of secondary index leaf pages waiting for records to be purged. We actually got a bug report for this:
    We only addressed a usability aspect of the bug: the time-consuming code was not interruptible by KILL QUERY.

    I think that the logic for purge sleeps (innodb_max_purge_lag) is a hack, and the real solution is making the purge subsystem faster. This happened in MySQL 5.5 (separate purge thread) and 5.6 (multiple purge threads).

    In its original form (implemented by me according to directions from Heikki) the innodb_max_purge_lag does not take the purge view into account. That is, if nothing is purgeable yet, it would not make sense to delay the DMLs. I do not remember if we fixed this yet.

    Your reasoning that undo pages are being evicted from the buffer pool, only to be brought back soon later by the purge subsystem sounds plausible. I wonder if you could add some instrumentation to collect some evidence that this is actually the case.

    There have been suggestions of using dedicated buffer pools for certain stuff (say, avoid evicting undo pages, or avoid evicting other pages when doing a bulk operation on certain table).

  • Marko,
    Thanks for your explanation. This might explain behavior I saw here https://www.percona.com/blog/2014/12/17/innodbs-multi-versioning-handling-can-be-achilles-heel/ with secondary index scans seems to be never ending as there are more versions of index entries can be generated than are being scanned.

    Now to your other points

    1) I agree purge should be made faster, however I believe you either have to have algorithm which guarantees out of bound history growth is impossible with any workload or system do it artificially with slowing down modifications. I do not think it is good to have database which in some rare condition will go into the vicious cycle slow itself down use all disk space and crash. Purging is probably the most important issue right now though there is also log flushing – currently behavior is very bad if you imagine very high write rate for in-memory data set with very slow IO subsystem which has hard time flushing with speed needed.

    2) In terms of undo space I had done similar benchmarks with completely fits in memory where there is nothing but undo space becomes larger than buffer pool. I can see what else we can capture. Are you looking for more information how to make purging better ?

  • Peter,

    Since you are using a single table with 8 purge threads, the dict_index_t::lock is most likely the problem. Especially with flush@commit setting = 2. In 5.7 Yasufumi Kinoshita did some great work to mitigate this problem.

    The UNDO log access has a very complex access pattern. It is a file based list with 3 main orderings:

    1. Ordered on row changes (the UNDO roll ptr) – Used by MVCC

    2. Ordered on oldest to latest change – Used by purge for garbage collection

    3. Ordered on transaction ID for rollback – Transactions can UNDO their changes if required

    Purge is currently not clever enough to schedule the purging in a way to reduce IO nor potential lock contention. Also, purge threads do the read synchronously, which slows them down further. With the UNDO log we can’t leverage any kind of read ahead currently. The access patterns are just to random, however I think there is room to improve it, but not easy 🙂

  • Peter, I guess your use case is nothing like below, but I decided to mention this for completeness.

    The worst case is that the undo logs can grow without limit.

    In MySQL, it is very easy to stop the purge activity. Anyone with the privilege to start a read-only transaction with a MVCC read view can do it. All it takes is BEGIN;SELECT … FROM some_innodb_table … or just the single statement: START TRANSACTION WITH CONSISTENT SNAPSHOT. After that, leave the session open, and purge wil have to stop once it reaches this read view. You can also block purge across server restarts by moving a transaction into XA PREPARE state before killing the server or initiating shutdown.

    An analogy is that the undo tablespace is a kitchen sink. New DML operations are a flow of water from the water faucet. No matter how much you throttle the water flow (slow down the undo log generation), the kitchen sink (undo tablespace) will not become any emptier (purge) unless you remove the drain plug (commit or roll back the transaction with the oldest read view).

    This might perhaps be addressed with some undo log quota that (using my analogy) would essentially close the main water valve until the plug is removed. Or we could have some kind of idle timeout or emergency rollback for forgotten transactions that would essentially open the drain plug.

  • Sunny,
    Thanks for explanation. I recognize it is not the easy problem. In this particular blog post I just wanted to point out how something you would not expect to affect it a lot – log flushing seems to affect it significantly. There is also seems like a lot of possibilities for improvement and I would be very interested to see how 5.7 fairs in the similar tests. There seems to be huge set of problems from my tests

    – There seems to be some serve mutex contention when you have large number of versions for some workloads
    – Prefetch probably can be employed some better way as I see some low concurrency IO for undo space
    – Some better preference for undo space caching might also help – right now I seems to see it is being pushed to the disk to quickly which is a waste as we know with very high probability we’ll need it again.
    – I’ve seen performance regressions after undo space has allowed to grow to high number – ie if it ever grows to be 100GB the performance never is going to be the same as with very small one. This specific is probably fixed in 5.7

  • Marco,

    I agree with you the unlimited undo space is a serious DOS opportunity especially with high update volume systems. It always puzzled me why even basic limit say innodb_max_undo_space_size was not implemented in MySQL. The thing is you can cause problems both in terms of making system to run out of disk space as well as you can cause system to have serve performance problems with history size grown unrestricted.

    In terms of implementing limit I think it is good to consider what limit is best to implement – should it be around max undo space size so system essentially becomes read only after certain point, or should it be rather based on selects – ie very old transaction which is holding old snapshot which prevents purging will be killed. In my opinion former is better as it kills one transaction which was probably started by error instead of making system essentially unavailable for writes.

Leave a Reply