MongoDB Replica Set Scenarios and Internals – Part II (Elections)

In this blog post, we will walk through the internals of the election process in MongoDB®, following on from a previous post on the internals of the replica set. You can read Part 1 here.

For this post, I am refer to the same configurations we discussed before.

Elections: As the term suggests, in MongoDB there is a freedom to “vote”: individual nodes of the cluster can vote and select their primary member for that replica set cluster.

Why Elections? MongoDB maintains high availability through this process.

When do elections take place?

  1. When the node does not found a primary node within the election timeout limit. By default this value is 10s, and from MongoDB version 3.2 this can be changed according to your needs.  The parameter to set this value is settings.electionTimeoutMillis  and can be seen in the logs as:

From the mongo shell, the value for the electionTimeoutMillis  can be found in replica set configuration as:

More precisely the value for electionTimeoutMillis  can be found at:

2.  If the priority of the existing primary node is being taken over by another node. For example, during planned maintenance using replica set configuration settings. The priority of the member node can be changed as explained here

The priority of all three members can be seen from the replica set configuration like this:

How do elections work in a MongoDB replica set cluster?

Before real elections, the node runs a dry election. Dry election? Yes, the node first runs dry elections, and if the node wins a dry election, then an actual election begins. Here’s how:

  1. Candidate node asks every node if another node would vote for it through replSetRequestVotes , without increasing the term itself.
  2. Primary node steps down if it finds a candidate node term higher than itself. Otherwise the dry election fails, and the replica set continues to run as is did before.
  3. If the dry election succeeds, then an actual election begins.
  4. For the real election, the node increments its term and then votes for itself.
  5. VoterRequester sends replSetRequestVotes command through ScatterGatherRunner and then each node responds back with their vote.
  6. The candidate that receives votes from the most nodes wins the election.
  7. Once the candidate wins, it transits to primary node. Through heartbeats it sends a notification to all other nodes.
  8. Then the candidate node checks if it needs to catch up from the former primary node.
  9. The node that receives the  replSetRequestVotes command checks its own term and then votes, but only after ReplicationCoordinator receives confirmation from TopologyCoordinator
  10. The TopologyCoordinator grants the vote after following considerations:
    1. Config version must be matched,
    2. Replica set name must be matched
    3. An arbiter voter must not see any healthy primary of greater or equal priority.

An example

A primary (port:25002) Transition to secondary after receiving the rs.stepDown()  command.

Dry election at candidate node (port:25001) and success: no primary found.