On TokuMX Oplog, Tailable Cursors, and Concurrency

In a post last week, I described the difference in concurrency behavior between MongoDB’s oplog and TokuMX’s oplog. In short, here are the key differences:

  • MongoDB protects access to the oplog with a database level reader/writer lock, whereas TokuMX does not.
  • TokuMX can write data to the oplog concurrently, whereas MongoDB cannot.
  • As a result, because a cursor holds the read lock when reading from the MongoDB oplog, the cursor may safely read any and all data available at the moment without risk of missing any data.
  • TokuMX, on the other hand, needs to be aware of possible “gaps” in the oplog. That is, suppose transactions A, B, and C want to write to the oplog concurrently in that order. However, A and C commit before B writes anything. If a cursor tries to read the oplog at that point in time, it will see transactions A and C, but miss B. At this point, there is a “gap” between A and C in the oplog.

In MongoDB and TokuMX, secondaries use tailable cursors to pull data from the primary to run replication. With MongoDB, the tailable cursor algorithm is simple:

  • Grab the read lock on the oplog.
  • Read any new data that exists in the oplog.
  • If data exists, return, otherwise, sleep until new data is available.

This algorithm works because MongoDB’s cursors do not need to be aware of gaps. This algorithm does not work for TokuMX.

So, how do tailable cursors on the oplog work in TokuMX? That is, how are secondaries able to pull data from the primary without skipping over any gaps? Here is how.

As explained here, each oplog entry has a GTID, which is the _id field in the oplog entry. We have an object, called the GTIDManager whose job on the primary is to do the following:

  • Provide GTIDs to transactions requesting them.
  • Keep track of what GTIDs have been handed out.

Transactions and the GTIDManager have the following protocol:

  • When a transaction is ready to commit and wants to write its operations to the oplog (a process described here), it requests a GTID from the GTIDManager
  • The GTIDManager provides the GTID, call it GTID ‘A’, and stores ‘A’ in a map of live GTIDs. This map keeps track of what GTIDs have been handed out to transactions but have yet to commit. This step, done very quickly, are protected by a “GTID mutex” in the GTID manager
  • The tranaction proceeds to write its data to the oplog and commit the transaction
  • The transaciton then notifies the GTIDManager that GTID ‘A’ has committed (or, if something went wrong, aborted).
  • The GTIDManager, reacquires the GTID mutex, and removes ‘A’ from its map of live GTIDs

A key point is this. At all times, the GTIDManager knows the list of GTIDs that are in process of committing, and therefore, knows where each possible gap in the oplog may exist.

Tailable cursors that read from the oplog have the following behavior:

  • When the cursor goes to read data, it first asks the GTIDManager, “what is the minimum live GTID?”
  • The GTIDManager acquires the GTID mutex, and returns the minimum GTID in its live GTIDs map. If no GTID is live, it returns the next GTID it would hand out. The GTID mutex is then released.
  • The cursor then proceeds to read all data from the oplog that is less than this GTID.

The important invariant about the value that the GTIDManager gives the tailable cursor is that the cursor knows no gaps exist before this value. Therefore, the cursor can safely read up to this value. Also notice in each of these steps that the GTID mutex is held for a very short time. The mutex is held only to assign GTIDs, update a map, and read from the map. All of these operations are very fast.

And that is how tailable cursors over the oplog were changed to support a more concurrent oplog.

Share this post

Leave a Reply