This is the second post in a series of posts that explains Ark, a consensus algorithm we’ve developed for TokuMX and MongoDB to fix known issues in elections and failover. The tech report we released last week describes the algorithm in full detail. These posts are a layman’s explanation.
In the first post, I described the various moving parts in MongoDB replication and failover:
In this post, I want to zero in on elections and describe how they currently work in detail. Kristina Chodorow has a really good explanation on elections here that really helped me while we were developing TokuMX. My explanation will focus on the threading model.
The threads that impact failover are the following:
An important thing to note: these threads work mostly independently from each other. While the following events are unlikely, they are possible:
The threading model is both a blessing and a curse. It’s a blessing in that the responsibilities of the various threads are very simple to understand. For example, the data sync threads simply replicate data. That is it. They don’t concern themselves with whether an election is taking place, so there’s little complexity in their inner workings. The model is a curse in that there is very little synchronization or shared information between the threads. Elections, at a high level, depend on information in the oplog such as “how far along is the oplog”, and this may change during the election.
With that said, let’s examine the existing election protocol.
We’ll call the current primary member OP (for original primary), and what will become the new primary (because it is ahead of everyone else) member NP (for new primary).
Suppose there is a network partition such that OP is disconnected from the replica set. At a high level, there are two independent things that must happen:
So, right off the bat, one thing we should think about is what happens if one of these steps doesn’t happen. Specifically, what happens if NP successfully elects itself because OP is temporarily disconnected, but OP never gets around transitioning to secondary before getting reconnected to the replica set. In short, we temporarily have two primaries.
Now let’s examine how NP goes about electing itself.
The current election process has two phases. The Ark tech report refers to these as a speculative election and an authoritative election.
In the first phase, NP asks every other member in the set, “I believe the primary is down and that I ought to step up and become the new primary. If I proceed to try to elect myself, will you vote yes?” This purpose of this speculative election is to give NP a good idea of whether an actual (authoritative) election will succeed. In the second phase, NP asks every other member in the set, “I want to become primary, please vote.”, and if a majority of votes come back saying “yes”, NP becomes primary.
Now let’s discuss how these phases work in more detail. In doing so, I will work backwards. I will discuss the second phase, the authoritative election, first. In doing so, I will show why these two phases exist.
In an authoritative election, NP has decided to try to become primary and sent messages to every other member requesting a vote. Other members process this vote on voting threads. When each member gets this message, it does the following:
If NP gets a majority of “yes” votes, NP becomes primary.
So why did I present the authoritative election first? It’s to demonstrate the following point: while successful elections are great, failed elections may have severe consequences. A failed election may have a non-negligible (but non-majority) subset of the replica set vote yes, and as a result disqualify themselves from voting in other elections for 30 seconds. This makes having successful elections in the next 30 seconds more difficult, and can lead to long periods of downtime while the set struggles to elect a new primary.
Because failed authoritative elections may have severe consequences, we only want to run them when we have good reason to believe that they will be successful. That is the purpose of the speculative election: to give NP good reason to believe its election will be successful, greatly reducing the probability of having a failed authoritative election.
Now let’s discuss some of the details of the first phase. NP sends a message to all other members asking “should I try to elect myself?”. With the replies NP learns:
That’s the election protocol, as it stands today. So what can go wrong? Theoretically, quite a bit.
High level issues with the elections
Problem 1: Problems with the existing implementation.
Charity Majors, in a talk at MongoDB World (not the keynote), said that at times elections may take 5-7 minutes to elect a primary. This doesn’t sound like the intended design and sounds like a bug. While developing Ark, we found issues with the existing MongoDB and TokuMX implementation that could theoretically lead to this behavior.
A key property of the design is essentially “use speculative elections to avoid running failed authoritative elections, because a failed authoritative election may lead to nobody getting elected for at least 30 seconds.” We think the following bugs that we’ve filed (which are now fixed with Ark) may lead to failed authoritative elections that speculative elections could have avoided:
Problem 2: Too little synchronization with replication threads.
Currently, the only time when elections and data sync threads communicate is when a voting thread responding to a speculative election peeks at the oplog to determine if NP is far enough ahead. This is not sufficient. Minimally, there ought to be some communication in the authoritative election. Additionally, while elections are happening, this member’s replication threads may be replicating and acknowledging data. This replicated and acknowledged data may be later rolled back because of the election this member is participating in.
Problem 3: Dealing with multiple primaries
The election protocol is designed to make the probability of having multiple primaries very low. Nevertheless, there may be periods of time where multiple primaries exist. These situations need to be resolved predictably, and currently, they are not.
These latter two problems are what causes writes that get acknowledged with majority write concern to possibly get rolled back. In my next post, I’ll explain why. As I mentioned at the beginning, if you don’t want to wait until the next post, you can read the tech report that has all the details.