MongoDB replication has some great features. As I discussed in my last post, MongoDB’s crash safety design is very elegant. In addition to that, MongoDB has automatic failover, parallel slave replication, and prefetch threads on secondaries. The latter, as Mark Callaghan points out, is similar to “InnoDB fake changes”, a feature that has helped Facebook run MySQL at scale.
Despite these great features, TokuMX replication is completely incompatible with MongoDB’s replication. As the saying goes, this is a feature and not a bug. We made this decision consciously fairly early on in our development cycle. What I would like to do in this blog post is explain why, and over the next series of posts explain how TokuMX replication works.
Quickly, let’s first discuss what is the same. In short, anything that does not touch the oplog. Failover, the election protocol, heartbeats, replica states, were not changed. We tweaked some diagnostic commands like rs.status() to display things we’ve changed, but at a high level, this stayed the same.
What changed? In short, anything that had to do with the oplog.
We had three big reasons for changing replication:
Let’s discuss each of these.
We considered two big things when thinking about transactions.
First, we knew we wanted to support multi-document transactions, and for good reasons. That meant if we have a batched insertion, we only want to write the inserts to the oplog if the transaction commits. I’ll elaborate on this more in a future post, but doing this with MongoDB’s existing replication protocol would be inefficient. Because MongoDB does not have these transactional semantics, document operations are written individually to the oplog. With a TokuMX batched operation, we found it more efficient to write one batched entry to the oplog as opposed to individual entries. So this design change was one motivation.
Second, with MongoDB transactional semantics, they depend on having idempotent oplog operations, meaning all oplog entries can be applied multiple times without changing the result of the document. I think, although I have no proof and would love to be corrected if wrong, a big reason is because without MVCC snapshot transactions, it was the best way to run an initial sync. The initial sync would take note of the oplog position when it began, data would be copied from various points in time to the secondary, and then the oplog was played forward from the recorded time to bring the machine to a known position in the oplog. Idempotency is relied upon to ensure that copying collections from various points in time was ok.
We did not want to rely on idempotency to ensure TokuMX replication was correct. We felt requiring idempotency on all oplog entries, now and future, would hinder our ability to innovate. Given our multi-statement, MVCC transactions, we didn’t need idempotency. We use a single transaction to initially copy data for an initial sync, and then play the oplog forward. There are some subtleties to our algorithm which I will delve into in a future post.
So, our desire to remove the idempotency requirement and support multi-document transactions were strong motivations.
MongoDB has a per-database reader/writer lock that restricts access to collections and the oplog. So, in a replica set, to write to document to a collection, the following steps are taken (disregarding journaling):
This has the property that data is “committed” to the oplog in a sequential manner. Should there be a crash, upon recovery, the oplog and collections are exactly up to date to a certain point in time.
If we were to have the same behavior with TokuMX, here is what would happen.
There may be several ways to do the above, but the key point is that the commit of the transaction, a non-trivial operation in TokuMX, would need to be under some mutex to ensure data is written to the oplog sequentially. This behavior would hurt performance.
Instead, this is what we do:
Note that the commit is now happening without a write lock held, and the local DB’s lock is now a read lock. This provides much better concurrency. But, the consequence is that data may be committed to the oplog out of order. Suppose we have transactions A, B, and C, and the following sequence of events happen:
Now, upon recovery, there will be a gap in the oplog between transactions A and C that we need to be mindful of, and we are. But we cannot support this behavior with MongoDB’s replication. So, this is another reason why we chose to change the replication.
Fully Leveraging Fractal Tree indexes
As I mentioned in the first paragraph, MongoDB has some great replication features on secondaries to improve performance, namely prefetch threads to bring data into memory in order to parallelize the I/O required to keep the secondary up to date. With TokuMX, we did not want our secondaries to parallelize their I/O to keep up, we wanted them to use Fractal Tree indexes to drastically reduce the I/O.
Fractal Tree indexes provide wonderful performance on writes, but do nothing special for reads. After all, if data resides on disk, no matter what your data structure is, I/O will be required to bring that data into memory. So, to fully leverage Fractal Tree indexes, we really really really don’t want to perform a read in order to do a write.
Unfortunately, MongoDB’s oplog has entries that require reads before we can perform the writes. And that is a good design for MongoDB’s B-tree based storage engine. B-Trees require I/O to do writes, so nobody cares if you first do the I/O to perform a read. But with Fractal Trees, this read requirement hurts, and brings our performance down greatly.
So, instead of leveraging the existing I/O parallelization algorithms of MongoDB, we decided to change the oplog format so that we could leverage fractal trees and get great I/O reduction on secondaries.
Hopefully, this post has given insight on why we chose the path we did for implementing TokuMX replication. In future posts, I will describe in much greater detail how TokuMX replication works.