This is the fourth post in a series of posts that explains Ark, a consensus algorithm we’ve developed for TokuMX and MongoDB to fix known issues in elections and failover. The tech report we released describes the algorithm in full detail. These posts are a layman’s explanation. This post assumes the reader is familiar with the first three parts. In this post, I describe how Ark fixes the existing problems.
In my last three posts, I’ve described the existing problems with elections and failover. Aside from miscellaneous behavioral improvements (such as SERVER-14382 or SERVER-14531), the problems that must be solved are:
- When a replica set has two primaries, the two primaries should never produce oplog entries whose positions interleave, and primary that produces smaller oplog positions should step down.
Threads that sync data and acknowledge writes need to communicate more with threads that vote for primaries:
- A member should not vote for a primary that will roll back a write it has acknowledged.
- A member should not acknowledge a write that may be rolled back due to an election it has already voted “yes” in.
In short, Ark’s key behavioral changes are to fix the two problems described above.
Fixing multiple primaries
The first thing Ark does is ensure that two primaries never produce oplog entries whose positions interleave. Here is how:
Unlike MongoDB, TokuMX does not use a timestamp as the oplog’s position identifier, also known as a global transaction identifier or GTID. TokuMX uses a pair of sequence numbers. Let’s call the GTIDs sequence numbers (term, opid). The GTID works as follows:
- Each oplog entry that a given primary produces will have a unique, increasing opid. So, member A may produce entries (5, 100), (5, 101), (5, 102)…
- Each successful election will have a unique term. When member A gets elected, it does so with the understanding that it will produce oplog entries with a term of 5, starting with (5,0).
- The GTID is compared lexicographically. For example, (5, 101) < (5, 102) < (6, 0).
If Ark can ensure that each successful election produces a unique term, two primaries never produce oplog entries whose positions interleave since GTID terms are compared first. Here is how Ark ensures that each successful election produces a unique term:
- Ark still has the two phases of elections referenced in part 2, the speculative and authoritative phase. In the authoritative phase, the member trying to vote itself primary does so with a proposed term. If member A has reached the authoritative phase and is trying to elect itself, it sends a message to all other members saying “I wish to become primary with a term of 5”.
- Each member voting in an election only votes yes for members with terms that are greater than any other terms it has voted for so far. So, if member B has already voted yes in an election for a member with a term of 13, it will automatically vote no in any subsequent election with a term <= 13, but is allowed to vote yes in any subsequent election with a term > 13.
These two simple rules ensure that no member votes yes in two separate elections with the same term. Because each successful election requires a majority, each pair of consecutive successful elections have at least one member in common. Therefore, no two successful elections will ever share the same term.
For those familiar with Raft, this term is similar to the election term described in Raft.
We’ve now described how Ark ensures that two primaries never produce oplog entries whose positions interleave. Now let’s explain how Ark ensures that the primary that produces smaller oplog entries steps down.
In MongoDB, a primary steps down under the following scenarios:
- It cannot reach a majority of the replica set.
- It sees another primary.
These conditions are not sufficient. As described in SERVER-9848, two primaries may see a majority of the set, but not each other. As a result, neither will step down. To fix this, Ark uses heartbeats to exchange election information between members. Thanks to heartbeats, members know the highest term that other members have voted for. If a primary sees that a member has voted for a higher term than its own, it steps down.
Now, dual primaries are guaranteed to resolve themselves properly. If two primaries exist, and both can reach a majority of the set, then both must have access to the highest term voted for in an election. This causes the primary with the smaller term to step down. If the primary with the smaller term doesn’t see a higher term, then it must not see a majority of the set, and will step down for that reason.
These changes cover how Ark resolves multiple primary situations.
Acknowledging Writes and Elections
The last problem is that of write acknowledgement. As I’ve mentioned in previous posts, the current problem is not that secondaries accept writes that they should not, because writes can roll back. The problem is that secondaries blindly acknowledge any write they receive, regardless of whether the write may later be rolled back. To fix this, we do the following:
A member never votes for a primary that will roll back a write it has acknowledged:
- When requesting a vote, the primary shares its oplog position.
- If the primary’s position is behind the voting member’s current position it doesn’t vote yes.
A member should not acknowledge a write that may be rolled back due to an election it has already voted “yes” in:
- Before acknowledging a write, a member checks to see if the write’s term is less than the highest term for which it has already voted. If so, the write is not acknowledged.
- For example, suppose a member wants to acknowledge (5,100), because it has just processed that oplog entry. If it notices that it has already voted yes in an election with term 6, the write is not acknowledged. This is because the member cannot acknowledge if its yes vote will cause this write to roll back.
How Majority Write Concern is Fixed
Let’s see if we can convince ourselves that writes that are acknowledged with a majority write concern never roll back.
Suppose a write has been acknowledged with majority write concern. Because each primary has its own term, we know that the only way this member loses its primary status is on some subsequent successful election with a higher term. That means the only way the write can roll back is because of a subsequent election. We know that any successful election must have a majority of “yes” votes, which must overlap with the majority that that acknowledged this write. As established above, each member in the overlap will only vote yes if it knows the potential primary will not roll back the write, meaning the potential new primary must also have the write and will not roll it back.
The original MongoDB election algorithm would allow a member to acknowledge a write but also vote “yes” for a new primary that would roll it back. Therefore, those members in the overlap did not actually protect the write as their acknowledgement should have guaranteed they would.