Looking deeper into InnoDB’s problem with many row versions

A few days ago I wrote about MySQL performance implications of InnoDB isolation modes and I touched briefly upon the bizarre performance regression I found with InnoDB handling a large amount of versions for a single row. Today I wanted to look a bit deeper into the problem, which I also filed as a bug.

First I validated in which conditions the problem happens. It seems to happen only in REPEATABLE-READ isolation mode and only in case there is some hot rows which get many row versions during a benchmark run. For example the problem does NOT happen if I run sysbench with “uniform” distribution.

In terms of concurrent selects it also seems to require some very special conditions – you need to have the connection to let some history accumulate by having read snapshot open and then do it again with high history. The exact queries to do that seems not to be that important.

Contrary to what I expected this problem also does not require multiple connections – I can repeat it with only one read and one write connection, which means it can happen in a lot more workloads.

Here is the simplified case to repeat it:

It is repeatable on the table with just 1 row where this row is very hot!

Select query sequence:

It is interesting though, in this case it does not look like it is enough to open a consistent snapshot – it is only when I run the query on the same table as the update workload is running (I assume touches the same data) when the issue happens.

Let’s look at some graphs:

The transaction rate indeed suffers dramatically – in this case going down more than 100x from close to 15K to around 60 in the lowest point.

It is interesting that CPU consumption is reduced but not as much as the throughput.

This is something I find the most enlightening about what seems to happen. The number of buffer pool read requests increases dramatically where the number of transactions goes down. If we do the math we have 940K buffer pool reads per 14.5K updates in the normal case giving us 65 buffer reads per update, which goes up to 37000 per update when we have regression happening. Which is a dramatic change and it points to something that goes badly wrong.

History size is of interest. We can see it grows quickly first – this is when we have the first transaction and sleep(20) and when it grows very slowly when the second transaction is open as update rate is low. Finally it goes to zero very quickly once the transaction is committed. What is worth noting is the history length here peaks at just 280K which is really quite trivial size.

The InnoDB contention graph is interesting from 2 view points. First it shows that the contention picture changes at the time when the transaction is held open and also right afterward when history is being worked out.

If you look at this graph focused only on the os_wait metrics we can see that mutex_os_waits goes down during the problem while x-locks os waits increases significantly. This means there are some very long exclusive lock os-waits happening which is indicative of the long waits. We can’t actually lock times from status variables – we would need to get data from performance schema for that which unfortunately does not cover key mutexes, and which I unfortunately can’t graph easily yet.

Lets look at some more data.

From the great processlist from sys-schema we can get:

Which is unfortunately not very helpful as it does not show the actual lock which is being waited on – just saying that it is waiting somewhere inside storage engine call which is obvious.

We can get some good summary information in the other table: