In this post, we’ll discuss what constitutes a big innodb_log_file_size, and how it can affect performance.
In the comments for our post on Percona Server 5.7 performance improvements, someone asked why we use innodb_log_file_size=10G with an indication that it might be too big?
In my previous post (https://www.percona.com/blog/2016/05/17/mysql-5-7-read-write-benchmarks/), the example used innodb_log_file_size=15G. Is that too big? Let’s take a more detailed look at this.
Looking at innodb_log_file_size
First, let me start by rephrasing my warning: the log file size should be set as big as possible, but not bigger than necessary. A bigger log file size is better for performance, but it has a drawback (a significant one) that you need to worry about: the recovery time after a crash. You need to balance recovery time in the rare event of a crash recovery versus maximizing throughput during peak operations. This limitation can translate to a 20x longer crash recovery process!
But how big is “big enough”? Is it 48MB (the default value), 1-2GB (which I often see in production), or 10-15GB (like we use for benchmarks)?
I wrote about how the innodb_log_file_size is related to background flushing five years ago, and I recommend this post if you are interested in details:
Since that time many improvements have been made both in Percona Server and MySQL, but a small innodb_log_file_size still affects the throughput.
How? Let’s review how writes happen in InnoDB. Practically all data page writes happen in the background. It seems as background writes shouldn’t affect user query performance, but it does. The more intense background writes are, the more resources are taken away from the user foreground workload. There are three big forces that rule background writes:
- How close checkpoint age is to the async point (again, see previous material https://www.percona.com/blog/2011/04/04/innodb-flushing-theory-and-solutions/). This is adaptive flushing.
- How close is innodb_max_dirty_pages_pct to the percentage of actual dirty pages? You can see this in the LRU flushing metrics.
- What amount of free pages are defined by innodb_lru_scan_depth. This is also in LRU flushing metrics.
So in this equation innodb_log_file_size defines the async point, and how big checkpoint age can be.
To show a practical application of these forces, I’ve provided some chart data. I will use charts from the Percona Monitoring and Management tool and data from Percona Server 5.7.
Before jumping to graphs, let me remind you that the max checkpoint age is defined not only by innodb_log_file_size, but also innodb_log_files_in_group (which is usually “2” by default). So innodb_log_file_size=2GB will have 4GB of log space, from which MySQL will use about 3.24GB (MySQL makes extra reservations to avoid a situation when we fully run out of log space).
Below are graphs from a tpcc-mysql benchmark with 1500 warehouses, which provides about 150GB of data. I used innodb_buffer_pool_size=64GB, and I made two runs:
- with innodb_log_file_size=2GB
- with innodb_log_file_size=15GB
Other details about my setup:
- CPU: 56 logical CPU threads servers Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz
- OS: Ubuntu 16.04
- Kernel 4.4.0-21-generic
- The storage device is Samsung SM863 SATA SSD, single device, with ext4 filesystem
- MySQL versions: Percona Server 5.7.11
- innodb_io_capacity=5000 / innodb_io_capacity_max=7500
On the first chart, let’s look at the max checkpoint age, current checkpoint age and amount of flushed pages per second:
. . . and also a related graph of how many pages are flushed by different forces (LRU flushing and adaptive flushing). You can receive this data by enabling innodb_monitor_enable = '%'.
From these charts, we can see that with 2GB innodb_log_file_size InnoDB is forced by adaptive flushing to flush (write) more pages, because the current checkpoint age (uncheckpointed bytes) is very close to Max Checkpoint Age. To see the checkpoint age in MySQL, you can use the innodb_metrics table and metrics recovery_log_lsn_checkpoint_age and recovery_log_max_modified_age_sync.
In the case using innodb_log_file_size=15GB, the main flushing is done via LRU flushing (to keep 5000 pages ( innodb_lru_scan_depth) free per buffer pool instance). From the first graph we can figure that uncheckpointed bytes never reach 12GB, so in this case using innodb_log_file_size=15GB is overkill. We might be fine with innodb_log_file_size=8GB – but we wouldn’t know unless we set the innodb_log_file_size big enough. MySQL 5.7 comes with a very convenient improvement: now it is much easier to change the innodb_log_file_size, but it still requires a server restart. I wish we could change it online as we can for innodb_buffer_pool_size (I do not see technical barriers for this).
Let’s also look into the InnoDB buffer pool content:
We can see that there are more modified pages in the case with 15GB log files (which is good, as more modified pages mean less work done in the background).
And the most interesting question: how does it affect throughput?
With innodb_log_file_size=2GB, the throughput is about 20% worse. With a 2GB log size, you can see that often zero transactions are processed within one second – this is bad and says that the flushing algorithm still needs improvements in cases when the checkpoint age is close to or at the async point.
This should make a convincing case that using big innodb_log_file_size is beneficial. In this particular case, probably 8GB (with innodb_log_files_in_group=2) would be enough.
What about the recovery time? To measure this, I killed mysqld when the checkpoint age (uncheckpointed bytes) was about 10GB. It appeared to take 20 mins to start mysqld after the crash. In another experiment with 25GB of uncheckpointed bytes, it took 45 mins. Unfortunately, crash recovery in MySQL is still singlethreaded, so it takes a long time to read and apply 10GB worth of changes (even from the fast storage).
We can see that recovery is single-threaded from the CPU usage chart during recovery:
The system uses 2% of the CPU (which corresponds to a single CPU).
In many cases, crash recovery is not a huge concern. People don’t always have to wait for MySQL recovery – since even one minute of downtime can be too long, often the instance fails over to a slave (especially with async replication), or the crashed node just leaves the cluster (if you use Percona XtraDB Cluster).
I would still like to see improvements in this area. Crash recovery is the biggest showstopper for using a big innodb_log_file_size, and I think it is possible to add parallelism similar to multithreaded slaves into the crash recovery process.