Index maintenance can be a real challenge for anyone managing databases, and what makes it even trickier is that open source databases each handle it differently. In this post, we’ll take a closer look at how those differences show up in practice, and what they mean for you.

When rows are added, updated, or deleted from a table, the database must adjust the indexes, which is similar to how libraries used to operate before the computer era. Imagine a number of indexes like: Author, Title, Subject, stored on paper cards in cabinets full of small drawers. When a book is added (an insert), someone has to retrieve the relevant cards from all indexes, write the new entry on them, and put them back in place. Databases perform similar tasks; retrieving a card becomes a read IO, and putting it back in place becomes a write IO. In the absence of any caching, each index will require at least a read and a write operation.

Instead of immediately inserting into all the secondary indexes, you could create a pending list of entries you’ll have to process at the end of your day. The hope here is that you may get two or more books from authors with similar names or on the same subject. Merging these updates together will save us time searching in the cabinets. The obvious drawback is that when you perform a regular search, you’ll also need to look at the “pending” list to see if the entry you are looking for is up-to-date. This is the index maintenance approach used by the MySQL InnoDB Change Buffer (CB).

With this introduction behind us, let’s examine the number of IOPs required to insert 10M rows with six large secondary indexes. The schema of the test table (here MySQL/InnoDB syntax) is:

The columns a, b, and c are random UUID values (v4). As I wrote in my blog MySQL UUIDs – Bad For Performance, I am not a fan of UUID values in a database, especially when there are indexes on them. In the present context, however, since we want to exacerbate ugly behaviors, they are awesome. All the details regarding the configurations and the test procedure can be found here.

My test environment is a KVM virtual machine with 4GB of memory and two vCPUs. The limited memory is on purpose since the goal is to highlight the IO aspects. Not that it matters much to the present discussion, the data storage is throttled to 1000 IOP/s of a maximum size of 16 kB. The IO throttling is somewhat similar to how AWS throttles EBS volumes, and is there just to make sure the bottleneck is known and controlled. Since the goal of this post is to highlight the IO efficiency of maintaining the data structures, the durability was relaxed.

For this little experiment, I chose commonly used releases of common open source databases instead of testing with bleeding-edge versions with little adoption. These are:

  • Percona Server for MySQL 8.0.40-31
  • Percona Server for PostgreSQL 15.10
  • Percona Server for MongoDB 6.0.20

Insertion results

The number of Read and Write IOPs required to insert the 20M rows is presented below.

Read and Write IOPs

Read and write IOPs for 10M rows inserted

The labels are explained in the next section, so let’s review the results in more detail in order of total number of IOPs used.

MySQL InnoDB

The most efficient technology in terms of total IOPs is InnoDB with the change buffer enabled. To be honest, I was expecting good performance, but not to that level. The impacts are important, especially when you compare with the feature disabled (InnoDB No CB). In this specific benchmark, enabling the change buffer reduced the amount of write IOPs by a factor of five and reads by a factor of 10! Without the change buffer, the InnoDB index maintenance performance is poor, especially for the number of write IOPs. That’s something to keep in mind, since the change buffer is disabled by default in MySQL 8.4.x.

Of course, there is a cost to pay for those low IOPs numbers. The indexes are in a “degraded” state at the end of the inserts, as there are missing entries in their B-trees. InnoDB will work in the background to update the indexes, but it could take hours, depending on server load. Forcing immediate scans of all secondary indexes took around 33 minutes and added nearly 1.4M write IOPs. Subsequent scans took only around seven minutes and caused no additional writes. It is also a complex and delicate piece of software that has had a number of issues over the years, which is likely the main reason behind the choice of disabling the change buffer by default in MySQL 8.4.x.

MySQL MyRocks/RocksDB

MySQL using RocksDB (aka MyRocks engine) is the second most efficient technology in our little experiment. While InnoDB is B+Tree-based, MyRocks uses LSM trees as its underlying data structure. Although LSM tree storage engines are great for writes, I wasn’t expecting the inserts to be that good when executed in random order. There were only about 28% more write IOPs, but more than four times the read IOPs. Those read IOPs are solely from the compaction operations.

MySQL InnoDB compression

I naively assumed adding InnoDB compression (CMP) would lower the amount of data written, but this is not the case. While there are fewer writes to the data file, writes to the double write buffer and the redo log files increased significantly. Part of that is caused by innodb_log_compressed_page. This variable causes recompressed pages to be written to the redo log file to prevent corruption issues when zlib is upgraded. With this variable, the redo log becomes a kind of triple write buffer. If you are not planning to upgrade MySQL and use InnoDB compression, you could consider setting this variable to OFF. Without this behavior (NOLOG), the amount of write IOPs is cut by nearly half. This is still, however, an increase of more than 60% compared to the original InnoDB numbers.

PostgreSQL

PostgreSQL is not cheating with index maintenance. For every insert, indexes are updated, and the results are very good. It is significantly better than InnoDB without the change buffer (NO CB). In order to achieve those low write IOPs numbers, I had to set the checkpoint_timeout to 30 minutes and the max_wal_size to 12GB. PostgreSQL uses the WAL not only for the redo log but also as a kind of double write buffer. Because of that, when coming from a MySQL background, one has to set max_wal_size to a much higher value than innodb_redo_log_capacity.

MongoDB

Finally, MongoDB/WiredTiger stands last, but more as a reference since I didn’t spend much time tuning it. In terms of the number of write IOPs, the results are still slightly better than InnoDB without the change buffer (NO CB). It really appears that InnoDB is not well-tuned for operations without the change buffer. I wonder why the read IOPs are so high, higher than any of the MySQL results. MongoDB relies partly on the file cache for the reads. This seems to be inefficient, as two caching systems stacked one on top of the other are likely caching the same data.

Insert rate stability

While the focus was on index maintenance, I couldn’t avoid looking into the stability of the insert rate. Database performance stability is very important because a sudden drop in performance can stress the stability of the whole infrastructure.

MySQL/InnoDB of the 5.0 era used to be pretty bad at performance stability under high writes. This was caused by a crude checkpointing algorithm that caused frequent flush sync events. The introduction of the adaptive flushing feature in InnoDB in 5.1 was a huge step toward solving this problem. Now, InnoDB delivers fairly stable write performance, at least under a steady write load.

InnoDB insert rate stability

InnoDB insert rate stability

There are some small fluctuations, but nothing major. This is impressive. If you want to learn more, my colleague Francisco and I wrote about how InnoDB flushing works a few years ago.

In the test environment, where the IOPs are limited, an LSM storage engine like RocksDB will struggle to maintain stable performance. There will be fast insert periods when only the WAL is being written to and slow periods when the write buffer is flushed or when compaction is triggered. The RocksDB results are shown below and indeed vary a lot.

MyRocks/RocksDB insert rate stability

MyRocks/RocksDB insert rate stability

While the above results were expected from the storage engine design, my initial results for PostgreSQL were disappointing, as shown here:

PostgreSQL mysterious insert stalls

While most of the time the insert rate is decently stable, there are long drops that last 20 to 40s. That instability puzzled me for a while. Such an issue would have had major implications for people using PostgreSQL, and obviously, this is not the case. Thanks to my colleague Jorge Torralba, one of our PostgreSQL experts, I found the culprit:

By default, PostgreSQL triggers a statistics refresh when about 10% of the tuples have been modified. With MySQL, InnoDB does something similar when innodb_stats_auto_recalc is enabled (default). Where things differ, though, is with the size of the sample used for the statistics. If I read the above log message correctly, PostgreSQL accessed in excess of 30k blocks to refresh the statistics, more than 4k per b-tree. In contrast, by default, InnoDB does 20 16KB page dives per B-tree, which would mean around 100 blocks accessed depending on the number of B-tree levels. Maybe that’s one of the reasons why the InnoDB statistics are so poor. Anyway, after disabling automatic statistic refreshes with:

I got a much improved behavior:

PostgreSQL insert rate with statistics refresh disabled.

This behavior is very predictable. If you suffer from similar performance dips caused by the autovaccum analyze tables processes, consider disabling the automatic process. Make sure, however, to set up a cron job to refresh the statistics at night or when the load is low; otherwise, your statistics will quickly be degraded.

Conclusion

Database engines are the result of years of development and design choices, and, as shown in this post, even for very simple tasks such as index maintenance, their behaviors are quite different.  If I can summarize a few of the things we learned with this insert experiment in an IO-limited environment:

  • InnoDB change buffer is quite efficient, but you contract an IO debt
  • InnoDB is the worst for writes without the change buffer
  • InnoDB compression increases the amount of write
  • MyRocks is very write-efficient IO-wise, but its insert rate fluctuates a lot
  • PostgreSQL is decently efficient, giventhat  it is fully updating the secondary indexes
  • MongoDB ranks last, but this is from an untuned configuration
  • PostgreSQL statistics collection is no joke, but quite impactful in an IO-limited environment
  • Compared to PostgreSQL, default InnoDB statistics are almost not serious. That is part of the reasons why InnoDB may suffer from unstable execution plans for large tables.

I am very surprised by the amount of knowledge gathered from such a simple experiment, and I intend to test other scenarios and see what else can be learned.

Subscribe
Notify of
guest

1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
lee

I know that in mysql 8.4 version, the server SSD spec is good these days, so the Default value comes out as none.
Is it a test that turned on/off the change buffer in the recent SSD specification?