This is the second post in a series of posts that explains Ark, a consensus algorithm we’ve developed for TokuMX and MongoDB to fix known issues in elections and failover. The tech report we released last week describes the algorithm in full detail. These posts are a layman’s explanation.
In the first post, I described the various moving parts in MongoDB replication and failover:
- elections, failover, and heartbeats
- replication rollback
- write acknowledgement
In this post, I want to zero in on elections and describe how they currently work in detail. Kristina Chodorow has a really good explanation on elections here that really helped me while we were developing TokuMX. My explanation will focus on the threading model.
The threads that impact failover are the following:
- Data sync threads: These threads are responsible for copying changes from some other member in the replica set, and acknowledging their arrival. This is the central purpose of replication.
- Heartbeat threads: Every two seconds, a member requests a heartbeat from another member to get status information such as oplog position, whether they are a primary, and to verify that they are available. So, in a five node replica set, each member has four threads, each requesting a heartbeat from another member in the set every two seconds.
- Manager thread: This thread has several responsibilities, but as far as failovers are concerned, this is the thread that decides to run an election. Given reason to run an election, which we will dive into below, this thread runs an election in an attempt to transition itself from secondary to primary.
- Voting threads: These threads handle requests from other members in the replica set that aim to elect themselves as primary. A member may be handling a vote request from another member on a voting thread while simultaneously trying to elect itself on the manager thread.
An important thing to note: these threads work mostly independently from each other. While the following events are unlikely, they are possible:
- Member A may be replicating off of member B even though A’s manager thread is acting as though B is down.
- Member A may be running an election even though it thinks the primary is up.
- Member A may be voting to replace a primary even while it is syncing from that primary.
The threading model is both a blessing and a curse. It’s a blessing in that the responsibilities of the various threads are very simple to understand. For example, the data sync threads simply replicate data. That is it. They don’t concern themselves with whether an election is taking place, so there’s little complexity in their inner workings. The model is a curse in that there is very little synchronization or shared information between the threads. Elections, at a high level, depend on information in the oplog such as “how far along is the oplog”, and this may change during the election.
With that said, let’s examine the existing election protocol.
We’ll call the current primary member OP (for original primary), and what will become the new primary (because it is ahead of everyone else) member NP (for new primary).
Suppose there is a network partition such that OP is disconnected from the replica set. At a high level, there are two independent things that must happen:
- NP notices that it cannot reach OP, looks at what it knows of the state of the replica set (via heartbeat messages it has received), and says “I think I’ll make a good primary”. NP then proceeds to elect itself.
- Independently, OP notices that it cannot reach a majority of the set, and decides to transition from primary → secondary.
So, right off the bat, one thing we should think about is what happens if one of these steps doesn’t happen. Specifically, what happens if NP successfully elects itself because OP is temporarily disconnected, but OP never gets around transitioning to secondary before getting reconnected to the replica set. In short, we temporarily have two primaries.
Now let’s examine how NP goes about electing itself.
The current election process has two phases. The Ark tech report refers to these as a speculative election and an authoritative election.
In the first phase, NP asks every other member in the set, “I believe the primary is down and that I ought to step up and become the new primary. If I proceed to try to elect myself, will you vote yes?” This purpose of this speculative election is to give NP a good idea of whether an actual (authoritative) election will succeed. In the second phase, NP asks every other member in the set, “I want to become primary, please vote.”, and if a majority of votes come back saying “yes”, NP becomes primary.
Now let’s discuss how these phases work in more detail. In doing so, I will work backwards. I will discuss the second phase, the authoritative election, first. In doing so, I will show why these two phases exist.
In an authoritative election, NP has decided to try to become primary and sent messages to every other member requesting a vote. Other members process this vote on voting threads. When each member gets this message, it does the following:
- Each member has the choice of voting “yes”, “no”, or “veto”. If any member votes “veto”, the election fails, regardless of whether a majority voted “yes”. Reasons a member may choose to veto (e.g. NP’s config version is stale) are not interesting for this discussion.
- If the member participating in the election has voted “yes” for anyone in the last 30 seconds, then vote “no”. A member may vote “yes” only once every thirty seconds.
Otherwise, vote “yes”. Don’t bother looking at the oplog’s position (as I’ll point out below, this is done in the first phase). This last fact is strange, but true, which I pointed out in https://groups.google.com/forum/?hl=en-US&fromgroups#!topic/mongodb-dev/lH3hs8h7NrE.
- Note that by voting “yes”, this member will not vote “yes” for anyone else in 30 seconds. Although I cannot be sure, I suspect this is to make it difficult for two primaries to be elected in a short period of time.
If NP gets a majority of “yes” votes, NP becomes primary.
So why did I present the authoritative election first? It’s to demonstrate the following point: while successful elections are great, failed elections may have severe consequences. A failed election may have a non-negligible (but non-majority) subset of the replica set vote yes, and as a result disqualify themselves from voting in other elections for 30 seconds. This makes having successful elections in the next 30 seconds more difficult, and can lead to long periods of downtime while the set struggles to elect a new primary.
Because failed authoritative elections may have severe consequences, we only want to run them when we have good reason to believe that they will be successful. That is the purpose of the speculative election: to give NP good reason to believe its election will be successful, greatly reducing the probability of having a failed authoritative election.
Now let’s discuss some of the details of the first phase. NP sends a message to all other members asking “should I try to elect myself?”. With the replies NP learns:
- Whether anyone would veto the election, thereby making NP’s election impossible. If so, NP does not proceed to the second phase. One possible reason is that while NP does not see OP, the original primary, other members do. If that is the case, these other members will tell NP to not bother with an election.
- Whether any member has an oplog that is ahead of NP’s oplog. If so, NP does not try to elect itself. After all, we want elections to elect the secondary that is furthest ahead. At the moment, this is the only synchronization done between elections and data sync threads.
That’s the election protocol, as it stands today. So what can go wrong? Theoretically, quite a bit.
High level issues with the elections
Problem 1: Problems with the existing implementation.
Charity Majors, in a talk at MongoDB World (not the keynote), said that at times elections may take 5-7 minutes to elect a primary. This doesn’t sound like the intended design and sounds like a bug. While developing Ark, we found issues with the existing MongoDB and TokuMX implementation that could theoretically lead to this behavior.
A key property of the design is essentially “use speculative elections to avoid running failed authoritative elections, because a failed authoritative election may lead to nobody getting elected for at least 30 seconds.” We think the following bugs that we’ve filed (which are now fixed with Ark) may lead to failed authoritative elections that speculative elections could have avoided:
Problem 2: Too little synchronization with replication threads.
Currently, the only time when elections and data sync threads communicate is when a voting thread responding to a speculative election peeks at the oplog to determine if NP is far enough ahead. This is not sufficient. Minimally, there ought to be some communication in the authoritative election. Additionally, while elections are happening, this member’s replication threads may be replicating and acknowledging data. This replicated and acknowledged data may be later rolled back because of the election this member is participating in.
Problem 3: Dealing with multiple primaries
The election protocol is designed to make the probability of having multiple primaries very low. Nevertheless, there may be periods of time where multiple primaries exist. These situations need to be resolved predictably, and currently, they are not.
These latter two problems are what causes writes that get acknowledged with majority write concern to possibly get rolled back. In my next post, I’ll explain why. As I mentioned at the beginning, if you don’t want to wait until the next post, you can read the tech report that has all the details.