This is the third post in a series of posts that explains Ark, a consensus algorithm we’ve developed for TokuMX and MongoDB to fix known issues in elections and failover. The tech report we released last week describes the algorithm in full detail. These posts are a layman’s explanation.
In the first post, I discussed what the basics of replication are, how it works, and what the important components are: syncing data (or basic replication), elections, replication rollback, and write acknowledgement. In the second post, I explained the current election algorithm and discussed some high level problems. In this post, I assume the reader is familiar with the first two posts and discuss why data that has been successfully acknowledged with majority write concern may be lost in a failover.
The best way to demonstrate how data may be lost is with examples. For each of our examples, suppose things start out as follows. We have a five node replica set: n1, n2, n3, n4, and n5, with the following setup:
Example 1
Suppose the following sequence of events happens:
Now we have two primaries, n1 and n3. One needs to step down. If n1 steps down, nothing bad happens, but if n3 steps down, a write that was acknowledged with majority write concern will be lost.
In MongoDB 2.4, this was quite possible. In dual primary scenarios, which primary stepped down was essentially arbitrary. SERVER-9765 addressed this issue. Now, in 2.6, a primary uses the timestamp of the last election to determine which primary should step down. In the example above, because n3 was elected at a later time than n1, n3 ought to remain primary and n1 ought to step down. Because members may participate in an election once every thirty seconds, the minimum amount of time between successful elections ought to be thirty seconds. That being said, I don’t understand how clock skew between different members of the replica set impacts this algorithm.
Example 2
This example expands a bit on the first one:
Now we have a big problem. Both n1 and n3 are primary, and both have performed a different write that has been acknowledged by a majority of the replica set. One needs to step down and rollback their write. Whoever does will have rolled back a write that was acknowledged with majority write concern.
This example demonstrates an underlying problem of MongoDB’s replication: a timestamp is used to determine oplog position (aka, GTID). The symptom this causes is that two different primaries may produce oplog entries whose positions interleave. Because write B happens after write A, as far as time goes, the timestamp stored in B’s oplog entry on n1 is greater than A’s oplog entry that has been replicated to n4. As a result, n4 thinks that n1 is farther ahead, and thinks it’s ok to start replicating off n1. As soon as this decision is made, n4 will replicate and acknowledge B, which allows both writes to be acknowledged by a majority.
TokuMX does not use timestamps to determine oplog position in its current election algorithm (preceding Ark). As a result, the problem listed above does not impact TokuMX. However, TokuMX does have the problem where two different primaries produce oplog entries whose positions interleave. The existing thirty second timer between elections makes the problem harder to explain, but it does exist.
Nevertheless, solving these issues does not fix the election protocol. There is one more problem. Here is one final example.
Example 3
This example is similar to the preceding two, in that n1 is a primary, and after a partition, n3 becomes primary. Let’s suppose that the resolution of dual primaries is robust and works. That is, we can deterministically predict that the primary that was elected earlier will step down. Also, let’s suppose that the positions of oplog entries produced do not intersect. That is, any entry produced by n3 (the later election) is guaranteed to come after any entry produced by n1 (the earlier election). We still have the following problem.
The problem here is that the process of write acknowledgement and voting for primaries are not communicating enough. n4 and n5 vote for n3, and then acknowledge a write that could be rolled back, in part because of the vote they just performed. This is the main underlying problem that TokuMX and MongoDB’s current election protocol do not take into account, and this is the problem Ark rectifies.
So, in summary, the underlying issues in the existing election protocol are as follows:
Making sure these properties hold is what Ark does. In my next post, I’ll start explaining Ark in detail.
Resources
RELATED POSTS