Last week, we introduced Ark, a consensus algorithm similar to Raft and Paxos we’ve developed for TokuMX and MongoDB. The purpose of Ark is to fix known issues in elections and failover. While the tech report detailing Ark explains everything formally, over the next few blog posts, I will go over Ark in layman’s terms.
Note that everything I explain in these blog posts is covered in the tech report, so if you’ve read the tech report, you won’t find anything new here.
In this first post, I only want to set the scene, and describe what the various important replication components related to elections and failover are. Those familiar with MongoDB may already know this.
There are three important components we’ll discuss here:
- replication rollback
- write acknowledgement
First, for those unfamiliar with MongoDB, let’s start with the basics.
What is MongoDB’s replication model?
The purpose of replication is to have multiple machines, or members, store the exact same data. It works as follows:
- One member in the set is designated as primary.
- The primary is the only member that accepts writes from users. Details of the write are written to the oplog (similar to the binary log in other databases, this can also be thought of as the replication log).
- All other members in the set are secondaries.
- Entries in the oplog have a position associated with them, that defines the order of writes. Secondaries apply modifications in the order they appear in the oplog.
- Secondaries cannot be modified by users. Secondaries constantly pull data that originates from the primary’s oplog, save the data in its own oplog, and apply it locally.
- An important point: the secondary does not have to pull data directly from the primary. Any secondary is allowed to pull data from any other member in the replica set, as long as the that other member’s position is ahead of the secondary’s position.
The collection of members that make up the primary and secondaries is called a “replica set”. Additionally, note that MongoDB replication is asynchronous. Secondaries are responsible for pulling data off other members that are further ahead of them, and announcing their progress. Primaries are not responsible for pushing data to secondaries to ensure that secondaries are up to date.
Now let’s introduce the other three components.
Elections and failover
One design goal for MongoDB replication is to be highly available. That means should the primary become unavailable, be it due to a crash, getting disconnected from the network, or some other reason, the rest of the replica set ought to notice and work together to select a new primary, so that the system may resume accepting writes. The process by which they arrive at a consensus to select a new primary is called, naturally, a consensus algorithm.
If the primary becomes unreachable by other members of the replica set, the other members will try to hold an “election”, a process by which they choose a new primary to start accepting user writes. A majority of members in the replica set are required to elect a new primary. Note that majority means greater than half, so if we have four members in the replica set, we need three to elect a primary.
Every two seconds, all members exchange information with each other, in what are called heartbeats. The two most important things exchanged as of now are the member’s current oplog position and the fact that a member is still up and responding. In MongoDB, a replica set member (be it primary, secondary, or arbiter) is deemed unreachable if a heartbeat fails.
A typical manner in which an election may take place is as follows:
- The primary loses network connectivity and becomes disconnected from the replica set.
- A secondary sends out heartbeats and doesn’t receive a response from the primary.
- The secondary decides the primary is unavailable and works on getting itself elected.
Another important concept is rollback. Take the following scenario:
- A member P gets some write applied.
- Due to some network partition, a failover happens, and new primary is elected.
- At the time of failover, the write that was just applied to P had not been replicated to any secondary.
- At some point, when P gets reconnected to the set, P will “roll back”, or undo the write it just did, because that write is not part of the new replication stream.
Note that due to the design decision to make MongoDB replication asynchronous, rollback is always a possibility.
Traditionally, in single node databases without replication, users consider data safe if a copy of the write operation is written to some recovery log (MongoDB calls this the journal), and the log is synced to disk. In a highly available system, this is not very effective, because if the primary becomes unavailable and a failover happens, whether the data is in the primary’s recovery log is irrelevant, since it still may be rolled back. What is relevant is whether the data is on the machine that takes over during failover.
MongoDB has a concept of “write concern” that gives the user an acknowledgement that a write has successfully been replicated to a member. So, if we have a replica set of five members, and a user requests a write acknowledgement from 2 members, or, “w : 2”, this means the user is waiting to know that his write has successfully made it to 2 members of the set. A write concern of “w : majority” means the user is waiting to know that his write has successfully made it to a majority of members.
While users can choose along a spectrum of choices for write concern (Eliot Horowitz, CTO of MongoDB, has a good presentation on this topic), the write concern that we are most interested in is “majority”. Theoretically, if a write has been acknowledged by a majority of the replica set, then any member that gets elected ought to include that write. Here is why:
- Any successful election will be run by a majority of the replica set.
- Elections ought to elect the member that has replicated the most data.
- If a write has been acknowledged by a majority of the replica set, then SOME member participating in the successful election has replicated and acknowledged the write. Therefore, anyone who gets elected ought to have replicated the write.
As we will show in future blogs, this is not currently the case. Ensuring this property is Ark’s biggest goal.
With this, the basic components that make up MongoDB replication and failover have been described. In my next post, I will start digging into the existing consensus algorithm used by MongoDB and TokuMX, and describe the problems it causes.