It has been a while since I have looked at InnoDB crash recovery. A lot has change in the last few years – we have serious crash recovery performance improvements in MySQL 5.5 and MySQL 5.6, we have solid state drives raising as typical high performance IO subsystem and we also have the ability to set much larger log files and often have a much larger InnoDB Buffer Pool to work with.
First let me revisit the challenge with have with InnoDB configuration. For write-intensive workloads it is extremely important to size innodb_log_file_size for good performance, however the longer log file size you have the longer you might have to wait for InnoDB to complete crash recovery, which impacts your recovery strategy.
How much can innodb_log_file_size impact performance? Massively! Doing intensive writes to a database that well fits in memory, I’d say there’s a 10x difference between having combined size of log files of 32GB vs 64MB.
Before we look at some performance numbers let’s talk about what InnoDB Crash Recovery time depends on:
Combined Innodb Log File Size – innodb_log_file_size*innodb_log_files_in_group is what really matters. It does not really matter which of those two you change. I prefer to keep innodb_log_files_in_group as default and only work with innodb_log_file_size. The larger size you have allocated the longer recovery will take.
innodb_checkpoint_age – Combined size of InnoDB log files defines how many changes not reflected in the tablespace we may have where innodb_checkpoint_age shows how much changes we actually have at the current moment, being an actual driving factor of recovery time. If you have very large log files allocated but for your workload innodb_checkpoint_age stays low chances are recovery will be quick. Be careful however – intensive writes can cause innodb_checkpoint_age to go much higher than the average for your workload causing recovery time from crashes at that time to be much longer.
Innodb Buffer Pool Size – This is another very important factor. During recovery, InnoDB has to redo changes to the unflushed/dirty pages from buffer pool, which is obviously limited by buffer pool size. This also means innodb_max_dirty_pages_pct can be used to impact recovery speed. This is the number of dirty pages being the true driving factor. With small buffer pool, a limited number of dirty pages based on the workload you might not have innodb_checkpoint_age to go high even if you have allowed for large log space.
Data Structure matters a lot for recovery speed. Generally shorter rows being updated will mean longer recovery time for the same log file size. This should make sense as shorter row changes means there is less log space produced for the same amount of page changes. If you do a lot of blob writes InnoDB crash recovery can be short even with relatively large log files.
Access Pattern is another key factor – the more “random” access is the more distinct pages you will have touched during the same innodb_checkpoint_age the longer recovery can take.
Hardware – Better hardware means recovery goes faster, as much is obvious. More specifically you will be looking for storage performance at low concurrency (both reads and writes are important) as well as fast CPU cores – crash recovery at this point is not able to use multiple cores effectively.
Let’s now look at the test….
I am running Sysbench on an 11GB table, designed to fit in the 12GB buffer pool. Here is the exact command:
sysbench --tx-rate=4000 --num-threads=64 --report-interval=10 --max-time=0 --max-requests=0 --rand-type=uniform --oltp-table-size=40000000 --mysql-user=root --mysql-password=password --test=/usr/share/doc/sysbench/tests/db/update_index.lua run
The box is rather low end i3-4010U (4 threads) CPU with a Samsung EVO 840GB SSD, so numbers are expected to be higher on real server hardware.
In my first test I’m injecting 4000 updates/sec which is about half of what the box can do at sustained load. I do this to illustrate more common load scenario as we rarely run systems at their saturation point in real world. The uniform distribution should mean worse case scenarios for in-memory workloads through I think recovery speed would be slower if u use random writes to the database much larger than the amount of memory.
At this workload I’m getting innodb_checkpoint_age of 15GB even though total log file size is 32GB. Crashing the system makes for about 40 minutes recovery time so the log was processed at the 6.25MB/sec
Here are some interesting graphs:
As you can see recovery is essentially close to single core. It also can be CPU bound at times (and will be more so with faster storage) – at certain times of recovery when logs are being scanned it can be completely CPU bound (see how IO wait essentially goes to zero at some times)
Over time as recovery progresses more and more blocks become cached, so they do not have to be read from the disk for log records to be applied, meaning the workload becomes more and more write bound.
This is an unweighted IO utilization graph where 1000 corresponds to 100% of time where at least one IO request was outstanding. As you can see from this and the previous drive, InnoDB does not keep the IO well saturated all the time during crash recovery.
Additionally to the first sysbench crash test I did two more – one running a system completely saturated with updates. This made innodb_checkpoint_age to go as high as 23.2GB and crash recovery took 1h 20 minutes, showing some 4.8MB/sec The thing to consider in this case is that MySQL was not able to keep up with purging the history so it was growing quickly meaning crash recovery had to cover a lot of undo space modifications.
Finally I also did a run with a more skewed pareto distribution which resulted in 9.8G innodb_checkpoint_age 33min crash recovery time and 4.94MB/sec of log processing speed.
As I explained above there are a lot of moving parts so your numbers are likely to be quite different, yet I hope this can provide some reasonable baseline you can use for calculation.
Note also waiting for the server to recover from the crash is only one way to deal with recovery. Even if you size log files to be very small you will likely need to deal with Operating System boot and when warmup which will take a few minutes. It is often much better to use a different primary method of crash recovery, such as failover to the MySQL Replication Slave or using Percona XtraDB Cluster. If you use these methods you can often use quite a high combined InnoDB log file size to optimize for performance.
Final Thoughts: Even though InnoDB Crash Recovery has improved in MySQL 5.5 and MySQL 5.6 there is still room to improve it even more. As we see from the resource usage graphs during recovery there is an opportunity to both use multiple CPU cores more effectively as well as drive IO subsystem with higher concurrency and in more sustained fashion.