MongoDB high availability is essential to ensure reliability, customer satisfaction, and business resilience in an increasingly interconnected and always-on digital environment. Ensuring high availability for database systems introduces complexity, as databases are stateful applications. Adding a new operational node to a cluster can take hours or even days, depending on the dataset size. This is why having the correct topology is crucial.
In this blog post, we review the minimal topology design needed to get as close as possible to the five-nines rule. We will also review how to deploy a multi-region MongoDB replica set for HA purposes in case of a total outage in one region. This implies that the nodes are distributed so the cluster can always maintain availability.
Two region topology
Let us start with a 3-node replica set with the default configuration:
1 2 3 4 |
(rs0:PRIMARY) admin > rs.status().members.forEach(member => {print('host: ' + member.name + ' stateStr: ' + member.stateStr )}) host: localhost:27000 stateStr: PRIMARY host: localhost:27001 stateStr: SECONDARY host: localhost:27002 stateStr: SECONDARY |
The nodes will be distributed as follows:
Region A
localhost:27000
localhost:27001
Region B
localhost:27002
Our current status of the cluster should have this setting:
1 2 3 4 5 6 7 8 |
(rs0:PRIMARY) admin > rs.status() { set: 'rs0', ... majorityVoteCount: 2, votingMembersCount: 3, ... } |
Now, let’s simulate Region A becoming unavailable. This can be done by killing the processes or rejecting network traffic for the nodes. The outcome is the same.
- Killing the process
1 2 3 4 5 |
shell> ps ax -o pid,cmd | grep mongod shell> 22073 /usr/bin/mongod --replSet rs0 --dbpath /home/vagrant/rs0/rs1/db --logpath /home/vagrant/rs0/rs1/mongod.log --port 27000 --fork --keyFile /home/vagrant/keyfile --wiredTigerCacheSizeGB 1 shell> 22132 /usr/bin/mongod --replSet rs0 --dbpath /home/vagrant/rs0/rs2/db --logpath /home/vagrant/rs0/rs2/mongod.log --port 27001 --fork --keyFile /home/vagrant/keyfile --wiredTigerCacheSizeGB 1 shell> 22191 /usr/bin/mongod --replSet rs0 --dbpath /home/vagrant/rs0/rs3/db --logpath /home/vagrant/rs0/rs3/mongod.log --port 27002 --fork --keyFile /home/vagrant/keyfile --wiredTigerCacheSizeGB 1 shell> kill 22073 22132 |
- Reject network traffic
1 2 3 4 5 6 7 8 9 |
sudo iptables -A INPUT -p tcp --dport 27000 -j DROP sudo iptables -A INPUT -p tcp --sport 27000 -j DROP sudo iptables -A OUTPUT -p tcp --dport 27000 -j DROP sudo iptables -A OUTPUT -p tcp --sport 27000 -j DROP sudo iptables -A INPUT -p tcp --dport 27001 -j DROP sudo iptables -A INPUT -p tcp --sport 27001 -j DROP sudo iptables -A OUTPUT -p tcp --dport 27001 -j DROP sudo iptables -A OUTPUT -p tcp --sport 27001 -j DROP |
Now, our cluster has two unreachable nodes:
1 2 3 4 |
(rs0:SECONDARY) admin > rs.status().members.forEach(member => {print('host: ' + member.name + ' stateStr: ' + member.stateStr )}) host: localhost:27000 stateStr: (not reachable/healthy) host: localhost:27001 stateStr: (not reachable/healthy) host: localhost:27002 stateStr: SECONDARY |
Now, we have only one node available that can’t become primary as the majorityVoteCount is unmet. This makes the cluster unavailable for writing and enters a read-only mode status.
This message should appear in the logs of the remaining node:
1 |
{"t":{"$date":"2024-07-19T16:46:33.029-03:00"},"s":"I", "c":"ELECTION", "id":4615655, "ctx":"ReplCoord-26","msg":"Not starting an election, since we are not electable","attr":{"reason":"Not standing for election because I cannot see a majority (mask 0x1)"}} |
In this case, we cannot bring the other nodes up again, so we must either remove the bad nodes or change its priority and votes properties in the replica set configuration until region A becomes available again.
Reconfiguring the replica set
Removing the nodes won’t be possible by running the rs.remove() as a PRIMARY node is needed to execute that command. We must modify the replica set configuration object and force a reconfiguration of the replica set by removing the down nodes or setting their priority and votes to 0.
- Removing the nodes
1. First, save the output of rs.config() to a variable:
1 |
(rs0:SECONDARY) admin > cfg = rs.conf() |
2. Print the members field (output suppressed for better readability):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
(rs0:SECONDARY) admin > cfg.members [ { _id: 0, host: 'localhost:27000', ... }, { _id: 1, host: 'localhost:27001', ... }, { _id: 2, host: 'localhost:27002', ... } ] |
3. Remove the unreachable nodes from the configuration and leave only the sane node.
1 2 3 4 5 6 |
(rs0:SECONDARY) admin > cfg.members = [{ ... _id: 2, ... host: 'localhost:27002', ... ... } ... ] |
4. Then reconfigure the replica set with the force option:
1 |
(rs0:SECONDARY) admin > rs.reconfig(cfg,{force: true}) |
-
-
- Changing the priority and votes setting. This method is less aggressive and will make it easier to get the nodes back once the outage is over.
-
1. First, save the output of rs.config() to a variable:
1 |
(rs0:SECONDARY) admin > cfg = rs.conf() |
2. Modify the priority and votes values of the cfg variable for the unreachable nodes.
1 2 3 4 5 |
(rs0:SECONDARY) admin > cfg = rs.conf() (rs0:SECONDARY) admin > cfg.members[1].priority = 0; (rs0:SECONDARY) admin > cfg.members[2].priority = 0; (rs0:SECONDARY) admin > cfg.members[1].votes = 0; (rs0:SECONDARY) admin > cfg.members[2].votes = 0; |
3. Then reconfigure the replica set with the force option:
1 |
(rs0:SECONDARY) admin > rs.reconfig(cfg,{force: true}) |
If we now check the status of the cluster, we will see that the number of nodes needed to elect a PRIMARY will be one, and our node will change its state to PRIMARY
1 2 3 4 5 6 7 8 |
(rs0:PRIMARY) admin > rs.status() { set: 'rs0', ... majorityVoteCount: 1, votingMembersCount: 1, ... } |
Once Region A becomes available again, we must revert the changes to get our nodes back into the replica set. This can be done using the rs.add() helper or manually editing the replica set configuration object as we did before and reconfiguring the replica set with rs.reconfig().
- Using rs.add() to add the nodes back
1 2 |
(rs0:PRIMARY) admin > rs.add('localhost:27000') (rs0:PRIMARY) admin > rs.add('localhost:27001') |
- Reconfiguring the replica set by setting the priority and votes
1 2 3 4 5 6 |
(rs0:PRIMARY) admin > cfg = rs.conf() (rs0:PRIMARY) admin > cfg.members[1].priority = 1; (rs0:PRIMARY) admin > cfg.members[2].priority = 1; (rs0:PRIMARY) admin > cfg.members[1].votes = 1; (rs0:PRIMARY) admin > cfg.members[2].votes = 1; (rs0:PRIMARY) admin > rs.reconfig(cfg) |
- Reconfiguring the replica set by adding the configuration for the missing nodes.
1 2 3 |
(rs0:PRIMARY) admin > cfg = rs.conf() (rs0:PRIMARY) admin > cfg.members[1] = { _id: 0, host: 'localhost:27000', ... } (rs0:PRIMARY) admin > cfg.members[2] = { _id: 1, host: 'localhost:27001', ... } |
Three region topology
So, how do we achieve high availability if a whole region goes down? We need at least three regions and all the nodes scattered across them. This way, if a whole region is unavailable, you will only lose a minority of nodes, which will not impact the cluster’s availability.
In our new scenario, we have three regions: Region A, Region B, and Region C. Our nodes are distributed as follows:
region A
localhost:27000
Region B
localhost:27001
Region C
localhost:27002
We will simulate a full region outage for Region C:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
(rs0:PRIMARY) admin > rs.status().members.forEach(member => {print('host: ' + member.name + ' stateStr: ' + member.stateStr )}) host: localhost:27000 stateStr: PRIMARY host: localhost:27001 stateStr: SECONDARY host: localhost:27002 stateStr: (not reachable/healthy) (rs0:PRIMARY) admin > rs.status() { set: 'rs0', ... majorityVoteCount: 2, votingMembersCount: 3, ... } |
We can see that even though one node is down, we still can meet the required majorityVoteCount, which is still 2.
In this case, we don’t need to change the topology or the configuration and can wait until Region C becomes available again.
Conclusion
To achieve a MongoDB high-availability cluster that can withstand a full regional outage, we must distribute our replica set members across at least three regions. This ensures that we have enough members to maintain a quorum in the event of a full region outage. Automated failover mechanisms, monitoring, testing of the system regularly, and regular backups to further safeguard against data loss and minimize recovery time are also important to ensure the cluster performs reliably under various failure scenarios.
Considerations
It is important to remember that the oplog and flowControl will also play an important role in this scenario, but that is out of the scope of this article. It is worth mentioning that it’s important to have a big oplog window to ensure the bad nodes won’t go out of sync, and a full sync will be needed. As a baseline, a good starting point for the oplog is 24 hrs. Alongside this, flowControl must also be monitored to ensure there won’t be performance degradation on write operations as the replication lag keeps increasing.