InnoDB Flushing: Theory and solutionsVadim Tkachenko
I mentioned problems with InnoDB flushing in a previous post. Before getting to ideas on a solution, let’s define some terms and take a look into theory.
The two most important parameters for InnoDB performance are innodb_buffer_pool_size and innodb_log_file_size. InnoDB works with data in memory, and all changes to data are performed in memory. In order to survive a crash or system failure, InnoDB is logging changes into InnoDB transaction logs. The size of the InnoDB transaction log defines how many changed blocks we can have in memory for a given period of time. The obvious question is: Why can’t we simply have a gigantic InnoDB transaction log? The answer is that the size of the transaction log affects recovery time after a crash. The bigger the log, the longer the recovery time.
Our current state is checkpoint_age, which is the age of the oldest modified non-flushed page. Checkpoint_age is located somewhere between 0 and innodb_log_file_size. Point 0 means there are no modified pages. Checkpoint_age can’t grow past innodb_log_file_size, as that would mean we would not be able to recover after a crash.
In fact, InnoDB has two protection points: “async” and “sync”. When checkpoint_age reaches “async”, InnoDB tries to flush as many pages as possible, while still allowing other queries. You can see the effect from this on the graph in my previous post, when throughput drops down to the floor. The “sync” stage is even worse. When we reach “sync”, InnoDB blocks other queries while trying to flush pages and return checkpoint_age to a point before “async”. This is done to prevent checkpoint_age from exceeding innodb_log_file_size. These are both abnormal operational stages for InnoDB and should be avoided at all cost. In current versions of InnoDB, the “sync” point is at about 7/8 of innodb_log_file_size, and the “async” point is at about 6/8 = 3/4 of innodb_log_file_size.
So, there is one critically important balancing act: On the one hand you want “checkpoint_age” as large as possible, as it defines performance. But, on the other hand, you should not reach the “async” point.
Our idea is to define another point T (target), which is before “async”, in order to have a gap for flexibility, and we will try at all cost to keep checkpoint_age from going past T. We assume that if we can keep “checkpoint_age” in the range 0 – T, we will achieve stable throughput.
Now, what factors affect checkpoint_age? When we execute queries that change pages (i.e., INSERT/UPDATE/DELETE), we perform writes to the log, we change pages, and checkpoint_age is growing. When we perform flushing of changed pages, checkpoint_age is going down.
So that means the main way we have to keep checkpoint_age within point T is to change the number of pages being flushed per second. That way, we can keep checkpoint_age down.
If this doesn’t help–and checkpoint_age keeps growing beyone T toward “async”–we have a second control mechanism: We can add a delay into INSERT/UPDATE/DELETE operations. This way we prevent checkpoint_age from growing and reaching “async”.
To summarize, the idea of our algorithm is : We keep checkpoint_age within point T by increasing or decreasing the number of pages flushed per second. If checkpoint_age continues to grow, we add “throttling” to prevent it. The throttling depends on the position of checkpoint_age — the closer to “async”, the bigger the throttling needed.
As mentioned in the previous post, the current implementation of flushing neighbors is problematic. When we want to flush N pages in order to control checkpoint_age, we expect that flushing pages will have an effect. In the current implementation, however, we may ask to flush 8 pages, and really only 1 of the pages flushed was useful (see previous post). So, we propose a fix for that: If we want to flush page P, we also flush only real neighbors, i.e.:
"......DDDDPDDDD....". In this way, not so many additional pages will be flushed, and it will be real sequential I/O.
We have some initial results which will follow, and they are promising. Here is graph we have showing our current experimental implementation. (If you want to look at the source code, it is here.)
You can see some jumps after the initial warmup stage, but the later throughput is tending toward a stable line.