EmergencyEMERGENCY? Get 24/7 Help Now!

How TokuMX Secondaries Work in Replication

 | April 15, 2014 |  Posted In: Tokutek, TokuView


As I’ve mentioned in previous posts, TokuMX replication differs quite a bit from MongoDB’s replication. The differences are large enough such that we’ve completely redone some of MongoDB’s existing algorithms. One such area is how secondaries apply oplog data from a primary. In this post, I’ll explain how.

In designing how secondaries apply oplog data, we did not look closely at how MongoDB does it. In fact, I’ve currently forgotten all I’ve learned about MongoDB’s implementation, so I am not in a position to compare the two. I think I recall that MongoDB’s oplog idempotency was a key to their algorithms. Because we chose not to be idempotent (to avoid complexity elsewhere), we couldn’t use the same design. Instead, we looked to another non-idempotent implementation for inspiration: MySQL.

Stephane Combaudon writes a nice explanation of how MySQL’s replication works here:

“On a slave, replication involves 2 threads: the IO thread which copies the binary log of the master to a local copy called the relay log and the SQL thread which then executes the queries written in the relay log. The current position of each thread is stored in a file: master.info for the IO thread and relay-log.info for the SQL thread.”

What is not mentioned here is that if the binary log is enabled on the slave, the SQL thread will also replicate the queries that are written in the relay log to the binary log.

With TokuMX, we wanted a similar approach. We wanted one thread to be responsible for producing oplog data with a tailable cursor and writing it, and another thread to be responsible for replaying the oplog data and applying it to collections. But we did not want a separate relay log and binary log. This seemed to add unnecessary complexity. Instead, with TokuMX, the oplog is responsible for the work of the relay log and the binary log. To merge these functions, we added the “applied” bit to the oplog.

Here is how TokuMX secondaries apply oplog data. Hopefully, with this explanation, the use of the “applied” bit becomes clear:

  • The “producer” thread reads oplog data from the primary (or another secondary in the case of chained replication), and writes that data to the oplog. In doing so, it sets the applied bit to false. This work happens within a single transaction. That means, should there be a crash, the system will know upon startup that this entry has not yet been applied to collections
  • Then the “applier” thread, within a single transaction, applies the oplog data that has been written, and updates the oplog entry’s applied bit from false to true.

A nice property of this design is that upon recovering from a crash, the oplog is guaranteed to be up to date to a certain point in time, and that no gaps exist. That is, we don’t need to be worried about some oplog entry missing whose GTID is less than the GTID of the final entry.

However, because the applier may naturally be behind the producer, upon recovering from a crash, we need to find and apply all transactions whose applied bit is set to false. Here is how we do it. Once a second, another background thread learns what the minimum unapplied GTID is, and writes it to the collection “local.replInfo”. Because this value is updated only once a second, it is not accurate. However, it is a nice conservative estimate of what the minimum unapplied GTID actually is. Upon starting up a secondary that has already been initial synced, we read the oplog forward from this value saved in local.replInfo (which cannot be much more than a second behind the end), and apply any transaction whose applied bit is false.

A downside to this design is that data is written to the oplog twice for each transaction, once by the producer, and once by the applier to update the “applied” bit. In CPU-bound write-heavy workloads, this may present an issue (although we have no evidence). If necessary, we can likely improve upon this in the future, but that discussion is for another day.



  • To clarify, in MySQL, the only time an event originated from the master is written to the slave’s binary log is when log-slave-updates is enabled. This option is a requirement in 5.6, if you are using GTIDs.

    In some cases, it will make sense to have a slave with binary logging enabled and log-slave-updates disabled. This allows one to easily audit writes to the slave outside of replication, or have databases on the slave which are not created on the master and subsequently perform Point In Time Recovery for that database.

  • Since producer and applier are separate threads:
    1. Even if the applier thread is falling behind due to high write load, the producer thread is still likely to be up to date with the primary, right?
    2. Other than network latency, any other causes for producer thread to fall behind primary?
    3. When a secondary is voted as a new primary, and its applier thread is behind its producer thread, will it continue to apply oplog entries till all applied bit updated as true, or will it become primary immediately and discard the oplog entries not applied yet?

    • The producer is not allowed to get too far ahead from the applier, so if the applier starts to fall behind, the producer will as well. Other than network latency, the producer may fall behind for reasons that the applier may fall far behind. When a secondary is voted primary, it is done so based on what the producer has consumed. Upon election, the machine will NOT become primary immedietely, but will wait for the applier to catch up, and then become primary. Because we make sure the applier is never too far behind the producer, this should not be an issue.

  • Hi Zardosht,
    has there been any work done in this area since you posted your article?
    We’ve got a high write load application and we’re trying to increase our theoretical replication throughput hopefully before we really need it.
    During load testing with the latest version of TokuMX, we can currently insert about 90.000 records / second into our primary, but the secondary is only writing about 30.000 records / seconds according to the mongostat output, even through both machines are identically configured.

Leave a Reply