Last week, we introduced Ark, a consensus algorithm similar to Raft and Paxos we’ve developed for TokuMX and MongoDB. The purpose of Ark is to fix known issues in elections and failover. While the tech report detailing Ark explains everything formally, over the next few blog posts, I will go over Ark in layman’s terms.
Note that everything I explain in these blog posts is covered in the tech report, so if you’ve read the tech report, you won’t find anything new here.
In this first post, I only want to set the scene, and describe what the various important replication components related to elections and failover are. Those familiar with MongoDB may already know this.
There are three important components we’ll discuss here:
First, for those unfamiliar with MongoDB, let’s start with the basics.
What is MongoDB’s replication model?
The purpose of replication is to have multiple machines, or members, store the exact same data. It works as follows:
The collection of members that make up the primary and secondaries is called a “replica set”. Additionally, note that MongoDB replication is asynchronous. Secondaries are responsible for pulling data off other members that are further ahead of them, and announcing their progress. Primaries are not responsible for pushing data to secondaries to ensure that secondaries are up to date.
Now let’s introduce the other three components.
Elections and failover
One design goal for MongoDB replication is to be highly available. That means should the primary become unavailable, be it due to a crash, getting disconnected from the network, or some other reason, the rest of the replica set ought to notice and work together to select a new primary, so that the system may resume accepting writes. The process by which they arrive at a consensus to select a new primary is called, naturally, a consensus algorithm.
If the primary becomes unreachable by other members of the replica set, the other members will try to hold an “election“, a process by which they choose a new primary to start accepting user writes. A majority of members in the replica set are required to elect a new primary. Note that majority means greater than half, so if we have four members in the replica set, we need three to elect a primary.
Every two seconds, all members exchange information with each other, in what are called heartbeats. The two most important things exchanged as of now are the member’s current oplog position and the fact that a member is still up and responding. In MongoDB, a replica set member (be it primary, secondary, or arbiter) is deemed unreachable if a heartbeat fails.
A typical manner in which an election may take place is as follows:
Replication Rollback
Another important concept is rollback. Take the following scenario:
Note that due to the design decision to make MongoDB replication asynchronous, rollback is always a possibility.
Write acknowledgement
Traditionally, in single node databases without replication, users consider data safe if a copy of the write operation is written to some recovery log (MongoDB calls this the journal), and the log is synced to disk. In a highly available system, this is not very effective, because if the primary becomes unavailable and a failover happens, whether the data is in the primary’s recovery log is irrelevant, since it still may be rolled back. What is relevant is whether the data is on the machine that takes over during failover.
MongoDB has a concept of “write concern” that gives the user an acknowledgement that a write has successfully been replicated to a member. So, if we have a replica set of five members, and a user requests a write acknowledgement from 2 members, or, “w : 2”, this means the user is waiting to know that his write has successfully made it to 2 members of the set. A write concern of “w : majority” means the user is waiting to know that his write has successfully made it to a majority of members.
While users can choose along a spectrum of choices for write concern (Eliot Horowitz, CTO of MongoDB, has a good presentation on this topic), the write concern that we are most interested in is “majority”. Theoretically, if a write has been acknowledged by a majority of the replica set, then any member that gets elected ought to include that write. Here is why:
As we will show in future blogs, this is not currently the case. Ensuring this property is Ark’s biggest goal.
With this, the basic components that make up MongoDB replication and failover have been described. In my next post, I will start digging into the existing consensus algorithm used by MongoDB and TokuMX, and describe the problems it causes.