The Road Story of a MyRocks/MariaDB Migration

myrocks migrationThis post has been written in collaboration with Nicolas Modalvsky of e-planning. Nicolas and I recently worked together on a tuning engagement involving MyRocks on MariaDB. While it is easy to find online articles and posts about InnoDB performance, finding information about MyRocks tuning is more difficult. Both storage engines are well documented but what is really missing are tuning road stories on MyRocks. Those stories are important to help understand the underlying behavior of MyRocks and to help address potential issues.

In this post, we’ll present you one such road story that allowed MyRocks to really shine. We’ll start by providing some context explaining why MyRocks was considered and then, you’ll be invited to follow our tuning iterations.

Context

Nicolas’s company, e-planning, is an online advertising company participating both as seller and buyer in real-time auctions of online ads through the OpenRTB protocol. For data that is used in the auction process, e-planning has servers in eight data centers across the world and uses a mesh-like replication topology with MariaDB GTIDs. Servers are split into three regions: the US, Europe, and Asia. Within each region, a full mesh is implemented between all servers in the region. For instance, in the US region, each database server has 11 replicas and is the replica of 11 other servers. The workload is upsert heavy and almost all queries are idempotent. There are tens of millions of inserts every day and the captured data is valid only for two months. Older data must be pruned. Replicas run on bare metal servers, usually Dual Xeon E5 v3 or v4, with 192 GB to 384 GB of RAM.

Until recently, InnoDB was used and the tables had the MariaDB option page_compressed=1 using zlib. Although the servers were able to process the workload, there were a number of drawbacks:

  • The servers were close to their limits and were slow to catch up with replication after a maintenance period
  • Older rows needed to be deleted daily, in small batches, a resource-intensive process
  • The dataset size was quite large at close to 540 GB
  • The storage write bandwidth was high at about 12MB/s for most of a regular day, shortening the life span of the SSDs.

In order to give some real numbers, here are the metadata of three active tables:

The combined size of these three tables is approximately 126GB.

First Attempt with MyRocks

Prior to the engagement with Percona, one of the servers was converted to MyRocks and configured with respect to the Facebook MyRocks wiki and other sources on the web. With MariaDB, RocksDB does not support LZ4 and Zstd, and Snappy is used by default. The resulting sizes of the three sample tables were:

The combined size of the three tables is now about 98GB, a 22% reduction compared to InnoDB. That’s a modest improvement, though a bit disappointing, and we’ll have to work on that front too. However, replication started to lag on nearly all the channels. A number of configuration tweaks were attempted, essentially to various buffers but nothing worked. At this point, Percona was called to the rescue.

The First Round of Tuning

A rapid survey of the MyRocks server found a low CPU usage but the storage was quite busy. The database was generating a rather modest rate of small write operations, around 400/s, averaging in size between 2 and 3 KB. Such a write pattern, especially for an LSM engine like RocksDB which essentially just writes sequentially, is a signature of fsync calls. SSDs are awesome for reads, good for writes but only ok for fsyncs.

There are only a few MySQL variables affecting fsyncs, and in the my.cnf, I found the following:

The binary log is fsynced only once per 1500 transactions and InnoDB was flushing its redo log files only once per second. But what about MyRocks? The variable rocksdb_flush_log_at_trx_commit was missing in the my.cnf and it has a default value of 1, which means one fsync of the RocksDB log file per transaction. Setting rocksdb_flush_log_at_trx_commit to 2 (once per second) allowed replication to immediately start recovering. A great start but the tuning effort was far from over.

Change #1: rocksdb_flush_log_at_trx_commit = 2

The next day, I was expecting the MyRocks server to be in sync with the others but it wasn’t the case. The replication speed slowed down during the night and the most active replication channels were again unable to cope with the incoming write load. The read load was also much higher. What was going on? Raising rocksdb_block_cache_size to 75% of the Ram helped but only a little bit. What did help, for a few hours, was a full manual compaction using:

Unfortunately, a few hours later, the issue was back. What was going on this time? We’ll come back to the read load issue a bit later because a parallel effort was in progress to improve data compression.

Improved Compression

For an unknown reason, MariaDB doesn’t support Zstd, the best compressor in my book. The default compression algorithm for RPM-based distro is Snappy. Snappy is quite fast but it is not as efficient. Our goal was to see how LZ4 for the upper LSM level and Zstd for the lowest level would compare. In order to do that, we either need to migrate to Percona Server for MySQL (which includes LZ4 and Zstd compression) or recompile MariaDB. The complex replication topology and the use of the MariaDB GTID implementation led us to favor the second solution so we recompiled MariaDB with the Zstd and LZ4 development libraries installed. The compilation went smoothly and then we updated the column family with:

and we forced a full compaction:

Once done, the results were much better:

This time the total size of the three tables is down to 54GB; that’s a reduction of 57%. For now, the results are good enough but we could have tried a few more things. For example, the Zstd compression level was only set to ‘1’, the lowest. This is the middle number in compression_opts of the column family definition.

Change #2: LZ4 and Zstd

The impacts on the increasing read load issue were marginal but we were done for the day.

High Memory Usage

The next morning, we noticed MariaDB had been killed during the night, an OOM kill. It appears to be a known issue with the way RockDB manages its block cache and the glibc malloc. We switched the memory allocator to jemalloc, essentially adding the following section to the my.cnf file:

and the issue was gone.

Change #3: jemalloc

Increased Read Load Over Time

So we were back at the increasing read load issue. After some research and analysis, the issue became obvious. The RocksDB block cache (similar to InnoDB buffer pool) is there to lower the number of reads but its content is uncompressed. Making the block cache bigger consumes memory that would normally be used by the OS file cache to cache the compressed files. But, we noticed the file cache was not fully used, there was quite a lot of unallocated memory. The culprit was the variable rocksdb_use_direct_reads which… as its name indicates, uses O_DIRECT for reads. O_DIRECT operations bypass the OS file cache. Setting this variable to OFF and lowering the block cache to about 25% of the memory made a big difference.

The read load was still rising a bit but at a much smaller pace. Instead of hours, it was days. That’s kind of expected given the workload and we were already planning for periodic manual compactions.

Change #4: rocksdb_use_direct_reads = OFF and rocksdb_block_cache_size = 48G (~25% of RAM)

Lazy Deletions with TTL

Like mentioned earlier, the recorded data is short-lived, it needs to be deleted after 60 days. MyRocks has an outstanding TTL feature: you can specify how long to keep a record in MyRocks. This appears to be an extraordinarily good fit for the purpose. In order to use the TTL feature, you first need to enable it with the variable:

and then, add comments to the tables you want to limit the lifespan of the records like here: