As I’ve mentioned in previous posts, TokuMX replication differs quite a bit from MongoDB’s replication. The differences are large enough such that we’ve completely redone some of MongoDB’s existing algorithms. One such area is how secondaries apply oplog data from a primary. In this post, I’ll explain how.
In designing how secondaries apply oplog data, we did not look closely at how MongoDB does it. In fact, I’ve currently forgotten all I’ve learned about MongoDB’s implementation, so I am not in a position to compare the two. I think I recall that MongoDB’s oplog idempotency was a key to their algorithms. Because we chose not to be idempotent (to avoid complexity elsewhere), we couldn’t use the same design. Instead, we looked to another non-idempotent implementation for inspiration: MySQL.
“On a slave, replication involves 2 threads: the IO thread which copies the binary log of the master to a local copy called the relay log and the SQL thread which then executes the queries written in the relay log. The current position of each thread is stored in a file: master.info for the IO thread and relay-log.info for the SQL thread.”
What is not mentioned here is that if the binary log is enabled on the slave, the SQL thread will also replicate the queries that are written in the relay log to the binary log.
With TokuMX, we wanted a similar approach. We wanted one thread to be responsible for producing oplog data with a tailable cursor and writing it, and another thread to be responsible for replaying the oplog data and applying it to collections. But we did not want a separate relay log and binary log. This seemed to add unnecessary complexity. Instead, with TokuMX, the oplog is responsible for the work of the relay log and the binary log. To merge these functions, we added the “applied” bit to the oplog.
Here is how TokuMX secondaries apply oplog data. Hopefully, with this explanation, the use of the “applied” bit becomes clear:
- The “producer” thread reads oplog data from the primary (or another secondary in the case of chained replication), and writes that data to the oplog. In doing so, it sets the applied bit to false. This work happens within a single transaction. That means, should there be a crash, the system will know upon startup that this entry has not yet been applied to collections
- Then the “applier” thread, within a single transaction, applies the oplog data that has been written, and updates the oplog entry’s applied bit from false to true.
A nice property of this design is that upon recovering from a crash, the oplog is guaranteed to be up to date to a certain point in time, and that no gaps exist. That is, we don’t need to be worried about some oplog entry missing whose GTID is less than the GTID of the final entry.
However, because the applier may naturally be behind the producer, upon recovering from a crash, we need to find and apply all transactions whose applied bit is set to false. Here is how we do it. Once a second, another background thread learns what the minimum unapplied GTID is, and writes it to the collection “local.replInfo”. Because this value is updated only once a second, it is not accurate. However, it is a nice conservative estimate of what the minimum unapplied GTID actually is. Upon starting up a secondary that has already been initial synced, we read the oplog forward from this value saved in local.replInfo (which cannot be much more than a second behind the end), and apply any transaction whose applied bit is false.
A downside to this design is that data is written to the oplog twice for each transaction, once by the producer, and once by the applier to update the “applied” bit. In CPU-bound write-heavy workloads, this may present an issue (although we have no evidence). If necessary, we can likely improve upon this in the future, but that discussion is for another day.