As I mentioned in my last post, TokuMX replication is completely incompatible with MongoDB replication. Replica sets (and sharded clusters, but that is for another blog) must be either entirely TokuMX or entirely MongoDB. This is by design. While elections and failover are basically the same, we have completely changed the oplog protocol.
In the next series of posts, I will describe how TokuMX replication keeps data in sync between machines in the replica set. In doing so, I will address the challenges we faced and algorithms we developed to address them. With so much that has changed, I think the best way to go about understanding how TokuMX replication now works is to first understand what the oplog looks like.
So, let’s peek at a sample TokuMX oplog entry, and compare it to MongoDB’s oplog entry.
Suppose I run the following update that modifies two documents:
|
1 |
rs0:PRIMARY> db.foo.update({a:1},{$inc : {b:1}}, {multi:true}) |
In MongoDB, this generates the following two oplog entries:
|
1 |
<br>{<br> "ts" : Timestamp(1395630045, 1),<br> "h" : NumberLong("-5671976232760685793"),<br> "v" : 2,<br> "op" : "u",<br> "ns" : "test.foo",<br> "o2" : {<br> "_id" : 1<br> },<br> "o" : {<br> "$set" : {<br> "b" : 11<br> }<br> }<br>}<br>{<br> "ts" : Timestamp(1395630045, 2),<br> "h" : NumberLong("-4250692499231572273"),<br> "v" : 2,<br> "op" : "u",<br> "ns" : "test.foo",<br> "o2" : {<br> "_id" : 2<br> },<br> "o" : {<br> "$set" : {<br> "b" : 21<br> }<br> }<br>}<br> |
With TokuMX, performing that update generates the following single oplog entry:
|
1 |
{<br> "_id" : BinData(0,"AAAAAAAAAAEAAAAAAAAABQ=="),<br> "ts" : ISODate("2014-03-24T03:18:05.181Z"),<br> "h" : NumberLong("2597381443224792352"),<br> "a" : true,<br> "ops" : [<br> {<br> "op" : "ur",<br> "ns" : "test.foo",<br> "pk" : {<br> "" : 1<br> },<br> "o" : {<br> "_id" : 1,<br> "a" : 1,<br> "b" : 10<br> },<br> "m" : {<br> "$inc" : {<br> "b" : 1<br> }<br> }<br> },<br> {<br> "op" : "ur",<br> "ns" : "test.foo",<br> "pk" : {<br> "" : 2<br> },<br> "o" : {<br> "_id" : 2,<br> "a" : 1,<br> "b" : 20<br> },<br> "m" : {<br> "$inc" : {<br> "b" : 1<br> }<br> }<br> }<br> ]<br>} |
Right off the bat, you’ll notice several differences. What I’d like to do now is introduce what each piece of the TokuMX oplog entry means.
Not every TokuMX oplog entry will have an “ops” field. The reason is that sometimes a large transaction does more work than can fit (or we would want to fit) in a single oplog entry. For such transactions, an oplog entry may look as follows:
|
1 |
{<br> "_id" : BinData(0,"AAAAAAAAAAEAAAAAAAAABQ=="),<br> "ts" : ISODate("2014-03-24T03:18:05.181Z"),<br> "h" : NumberLong("2597381443224792352"),<br> "a" : true,<br> "ref" : ObjectId("532fa922daaf6e2b4e0ceea5")<br>} |
Note we now have a field named “ref”. This field is a reference into the oplog.refs collection. In this synthetic example, the oplog.refs collection now has the following documents:
|
1 |
{<br> "_id" : {<br> "oid" : ObjectId("532fa922daaf6e2b4e0ceea5"),<br> "seq" : NumberLong(3)<br> },<br> "ops" : [<br> {<br> "op" : "ur",<br> "ns" : "test.foo",<br> "pk" : {<br> "" : 1<br> },<br> "o" : {<br> "_id" : 1,<br> "a" : 1,<br> "b" : 11<br> },<br> "m" : {<br> "$inc" : {<br> "b" : 1<br> }<br> }<br> }<br> ]<br>}<br>{<br> "_id" : {<br> "oid" : ObjectId("532fa922daaf6e2b4e0ceea5"),<br> "seq" : NumberLong(5)<br> },<br> "ops" : [<br> {<br> "op" : "ur",<br> "ns" : "test.foo",<br> "pk" : {<br> "" : 2<br> },<br> "o" : {<br> "_id" : 2,<br> "a" : 1,<br> "b" : 21<br> },<br> "m" : {<br> "$inc" : {<br> "b" : 1<br> }<br> }<br> }<br> ]<br>} |
The _id field has two parts, the reference that was stored in the oplog.rs collection, and a sequence number to link different documents with the same reference. Please note that this example is very synthetic and not realistic. In reality, the array of ops in these entries will be quite large (by default, the first one should be at least 1MB). This provides a mechanism for storing the operations of large transactions that cannot be stored in a single entry
Hopefully, at this point, the TokuMX oplog is understood. You may wonder why we did some of what we did, and I will hopefully address all of that as these series of posts progress.
Resources
RELATED POSTS