Scaling TokuDB Performance with Binlog Group CommitRich.Prohaska
TokuDB offers high throughput for write intensive applications, and the throughput scales with the number of concurrent clients. However, when the binary log is turned on, TokuDB 7.5.2 throughput suffers. The throughput scaling problem is caused by a poor interaction between the binary log group commit algorithm in MySQL 5.6 and the way TokuDB commits transactions. TokuDB 7.5.4 for Percona Server 5.6 fixes this problem, and the result is roughly an order of magnitude increase in SysBench throughput for in memory workloads.
MySQL uses two phase commit protocol to synchronize the MySQL binary log with the recovery logs of the storage engines when a transaction commits. Since fsync’s are used to ensure the durability of the data in the various logs, and fsync’s can be very slow, the fsync can easily become a bottleneck. A group commit algorithm can be used to amortize the fsync cost over many log writes. The binary log group commit algorithm is intended to amortize the cost of the binary log fsync’s over many transactions.
When a transaction commits, a transaction runs through a prepare phase and a commit phase. Hey, it is called two phase commit for a reason.
During the prepare phase, TokuDB writes a prepare event to its recovery log and uses a group commit algorithm to fsync its recovery log. Since there can be many transactions in the prepare phase concurrently, the transaction prepare throughput scales with the number of transactions.
During the commit phase, the transaction’s write events are written to the binary log and the binary log is fsync’ed. MySQL 5.6 uses a group commit algorithm to fsync the binary log. Also during the commit phase, TokuDB writes a commit event to its recovery log and uses a group commit algorithm to fsync its recovery log. Since the transaction has already been prepared and the binlog has already been written, the fsync of the TokuDB recovery log is not necessary. XA crash recovery will commit all of the prepared transactions that the binary log knows about and abort the others.
Unfortunately, MySQL 5.6 serializes the commit phase so that the commit order is the same as the write order in the binary log. Since the commit phase is serialized, TokuDB’s group commit algorithm is ineffective. Luckily, MySQL 5.6 tells TokuDB to ignore durability in the commit phase (the HA_IGNORE_DURABILITY property is set), so TokuDB does not fsync its recovery log. This fixes the throughput bottleneck caused by serialized fsync’s of the TokuDB recovery log during the commit phase of the two phase commit.
Since MariaDB uses a different binlog group commit algorithm, we have some additional work to ensure that TokuDB works nicely with it.
We used the SysBench update non-indexed test to measure throughput and will post a detailed blog with results later.