MongoDB: High Availability Topology for a Multi-Region Setting

MongoDB high availability is essential to ensure reliability, customer satisfaction, and business resilience in an increasingly interconnected and always-on digital environment. Ensuring high availability for database systems introduces complexity, as databases are stateful applications. Adding a new operational node to a cluster can take hours or even days, depending on the dataset size. This is why having the correct topology is crucial.

In this blog post, we review the minimal topology design needed to get as close as possible to the five-nines rule. We will also review how to deploy a multi-region MongoDB replica set for HA purposes in case of a total outage in one region. This implies that the nodes are distributed so the cluster can always maintain availability.

Two region topology

Let us start with a 3-node replica set with the default configuration:

(rs0:PRIMARY) admin > rs.status().members.forEach(member => {print('host: ' + member.name + ' stateStr: ' + member.stateStr )})
host: localhost:27000 stateStr: PRIMARY
host: localhost:27001 stateStr: SECONDARY
host: localhost:27002 stateStr: SECONDARY

(rs0:PRIMARY) admin > rs.status().members.forEach(member => {print('host: ' + member.name + ' stateStr: ' + member.stateStr )})

host: localhost:27000 stateStr: PRIMARY

host: localhost:27001 stateStr: SECONDARY

host: localhost:27002 stateStr: SECONDARY

The nodes will be distributed as follows:

Region A
localhost:27000
localhost:27001

Region B
localhost:27002

Our current status of the cluster should have this setting:

(rs0:PRIMARY) admin > rs.status()
{
  set: 'rs0',
  ...
  majorityVoteCount: 2,
  votingMembersCount: 3,
 ...
}

(rs0:PRIMARY) admin > rs.status()

{

set: 'rs0',

...

majorityVoteCount: 2,

votingMembersCount: 3,

...

}

Now, let’s simulate Region A becoming unavailable. This can be done by killing the processes or rejecting network traffic for the nodes. The outcome is the same.

Killing the process

shell> ps ax -o pid,cmd | grep mongod
shell> 22073 /usr/bin/mongod --replSet rs0 --dbpath /home/vagrant/rs0/rs1/db --logpath /home/vagrant/rs0/rs1/mongod.log --port 27000 --fork --keyFile /home/vagrant/keyfile --wiredTigerCacheSizeGB 1
shell> 22132 /usr/bin/mongod --replSet rs0 --dbpath /home/vagrant/rs0/rs2/db --logpath /home/vagrant/rs0/rs2/mongod.log --port 27001 --fork --keyFile /home/vagrant/keyfile --wiredTigerCacheSizeGB 1
shell> 22191 /usr/bin/mongod --replSet rs0 --dbpath /home/vagrant/rs0/rs3/db --logpath /home/vagrant/rs0/rs3/mongod.log --port 27002 --fork --keyFile /home/vagrant/keyfile --wiredTigerCacheSizeGB 1
shell> kill 22073 22132

shell> ps ax -o pid,cmd | grep mongod

shell> 22073 /usr/bin/mongod --replSet rs0 --dbpath /home/vagrant/rs0/rs1/db --logpath /home/vagrant/rs0/rs1/mongod.log --port 27000 --fork --keyFile /home/vagrant/keyfile --wiredTigerCacheSizeGB 1

shell> 22132 /usr/bin/mongod --replSet rs0 --dbpath /home/vagrant/rs0/rs2/db --logpath /home/vagrant/rs0/rs2/mongod.log --port 27001 --fork --keyFile /home/vagrant/keyfile --wiredTigerCacheSizeGB 1

shell> 22191 /usr/bin/mongod --replSet rs0 --dbpath /home/vagrant/rs0/rs3/db --logpath /home/vagrant/rs0/rs3/mongod.log --port 27002 --fork --keyFile /home/vagrant/keyfile --wiredTigerCacheSizeGB 1

shell> kill 22073 22132

Reject network traffic

sudo iptables -A INPUT -p tcp --dport  27000 -j DROP
sudo iptables -A INPUT -p tcp --sport  27000 -j DROP
sudo iptables -A OUTPUT -p tcp --dport 27000 -j DROP
sudo iptables -A OUTPUT -p tcp --sport 27000 -j DROP

sudo iptables -A INPUT -p tcp --dport  27001 -j DROP
sudo iptables -A INPUT -p tcp --sport  27001 -j DROP
sudo iptables -A OUTPUT -p tcp --dport 27001 -j DROP
sudo iptables -A OUTPUT -p tcp --sport 27001 -j DROP

sudo iptables -A INPUT -p tcp --dport 27000 -j DROP

sudo iptables -A INPUT -p tcp --sport 27000 -j DROP

sudo iptables -A OUTPUT -p tcp --dport 27000 -j DROP

sudo iptables -A OUTPUT -p tcp --sport 27000 -j DROP

sudo iptables -A INPUT -p tcp --dport 27001 -j DROP

sudo iptables -A INPUT -p tcp --sport 27001 -j DROP

sudo iptables -A OUTPUT -p tcp --dport 27001 -j DROP

sudo iptables -A OUTPUT -p tcp --sport 27001 -j DROP

Now, our cluster has two unreachable nodes:

(rs0:SECONDARY) admin > rs.status().members.forEach(member => {print('host: ' + member.name + ' stateStr: ' + member.stateStr )})
host: localhost:27000 stateStr: (not reachable/healthy)
host: localhost:27001 stateStr: (not reachable/healthy)
host: localhost:27002 stateStr: SECONDARY

(rs0:SECONDARY) admin > rs.status().members.forEach(member => {print('host: ' + member.name + ' stateStr: ' + member.stateStr )})

host: localhost:27000 stateStr: (not reachable/healthy)

host: localhost:27001 stateStr: (not reachable/healthy)

host: localhost:27002 stateStr: SECONDARY

Now, we have only one node available that can’t become primary as the majorityVoteCount is unmet. This makes the cluster unavailable for writing and enters a read-only mode status.

This message should appear in the logs of the remaining node:

{"t":{"$date":"2024-07-19T16:46:33.029-03:00"},"s":"I",  "c":"ELECTION", "id":4615655, "ctx":"ReplCoord-26","msg":"Not starting an election, since we are not electable","attr":{"reason":"Not standing for election because I cannot see a majority (mask 0x1)"}}

1	{"t":{"$date":"2024-07-19T16:46:33.029-03:00"},"s":"I", "c":"ELECTION", "id":4615655, "ctx":"ReplCoord-26","msg":"Not starting an election, since we are not electable","attr":{"reason":"Not standing for election because I cannot see a majority (mask 0x1)"}}

In this case, we cannot bring the other nodes up again, so we must either remove the bad nodes or change its priority and votes properties in the replica set configuration until region A becomes available again.

Reconfiguring the replica set

Removing the nodes won’t be possible by running the rs.remove() as a PRIMARY node is needed to execute that command. We must modify the replica set configuration object and force a reconfiguration of the replica set by removing the down nodes or setting their priority and votes to 0.

Removing the nodes

1. First, save the output of rs.config() to a variable:

(rs0:SECONDARY) admin > cfg = rs.conf()

1	(rs0:SECONDARY) admin > cfg = rs.conf()

2. Print the members field (output suppressed for better readability):

(rs0:SECONDARY) admin > cfg.members
[
  {
    _id: 0,
    host: 'localhost:27000',
   ...
  },
  {
    _id: 1,
    host: 'localhost:27001',
   ...
  },
  {
    _id: 2,
    host: 'localhost:27002',
   ...
  }
]

(rs0:SECONDARY) admin > cfg.members

[

{

_id: 0,

host: 'localhost:27000',

...

{

_id: 1,

host: 'localhost:27001',

...

{

_id: 2,

host: 'localhost:27002',

...

}

]

3. Remove the unreachable nodes from the configuration and leave only the sane node.

(rs0:SECONDARY) admin > cfg.members = [{
...     _id: 2,
...     host: 'localhost:27002',
       ...
...   }
... ]

(rs0:SECONDARY) admin > cfg.members = [{

... _id: 2,

... host: 'localhost:27002',

...

... }

... ]

4. Then reconfigure the replica set with the force option:

(rs0:SECONDARY) admin > rs.reconfig(cfg,{force: true})

1	(rs0:SECONDARY) admin > rs.reconfig(cfg,{force: true})

- - Changing the priority and votes setting. This method is less aggressive and will make it easier to get the nodes back once the outage is over.

1. First, save the output of rs.config() to a variable:

(rs0:SECONDARY) admin > cfg = rs.conf()

1	(rs0:SECONDARY) admin > cfg = rs.conf()

2. Modify the priority and votes values of the cfg variable for the unreachable nodes.

(rs0:SECONDARY) admin > cfg = rs.conf()
(rs0:SECONDARY) admin > cfg.members[1].priority = 0;
(rs0:SECONDARY) admin > cfg.members[2].priority = 0;
(rs0:SECONDARY) admin > cfg.members[1].votes = 0;
(rs0:SECONDARY) admin > cfg.members[2].votes = 0;

(rs0:SECONDARY) admin > cfg = rs.conf()

(rs0:SECONDARY) admin > cfg.members[1].priority = 0;

(rs0:SECONDARY) admin > cfg.members[2].priority = 0;

(rs0:SECONDARY) admin > cfg.members[1].votes = 0;

(rs0:SECONDARY) admin > cfg.members[2].votes = 0;

3. Then reconfigure the replica set with the force option:

(rs0:SECONDARY) admin > rs.reconfig(cfg,{force: true})

1	(rs0:SECONDARY) admin > rs.reconfig(cfg,{force: true})

If we now check the status of the cluster, we will see that the number of nodes needed to elect a PRIMARY will be one, and our node will change its state to PRIMARY

(rs0:PRIMARY) admin > rs.status()
{
  set: 'rs0',
 ...
  majorityVoteCount: 1,
  votingMembersCount: 1,
 ...
 }

(rs0:PRIMARY) admin > rs.status()

{

set: 'rs0',

...

majorityVoteCount: 1,

votingMembersCount: 1,

...

}

Once Region A becomes available again, we must revert the changes to get our nodes back into the replica set. This can be done using the rs.add() helper or manually editing the replica set configuration object as we did before and reconfiguring the replica set with rs.reconfig().

Using rs.add() to add the nodes back

(rs0:PRIMARY) admin > rs.add('localhost:27000')
(rs0:PRIMARY) admin > rs.add('localhost:27001')

1 2	(rs0:PRIMARY) admin > rs.add('localhost:27000') (rs0:PRIMARY) admin > rs.add('localhost:27001')

Reconfiguring the replica set by setting the priority and votes

(rs0:PRIMARY) admin > cfg = rs.conf()
(rs0:PRIMARY) admin > cfg.members[1].priority = 1;
(rs0:PRIMARY) admin > cfg.members[2].priority = 1;
(rs0:PRIMARY) admin > cfg.members[1].votes = 1;
(rs0:PRIMARY) admin > cfg.members[2].votes = 1;
(rs0:PRIMARY) admin > rs.reconfig(cfg)

(rs0:PRIMARY) admin > cfg = rs.conf()

(rs0:PRIMARY) admin > cfg.members[1].priority = 1;

(rs0:PRIMARY) admin > cfg.members[2].priority = 1;

(rs0:PRIMARY) admin > cfg.members[1].votes = 1;

(rs0:PRIMARY) admin > cfg.members[2].votes = 1;

(rs0:PRIMARY) admin > rs.reconfig(cfg)

Reconfiguring the replica set by adding the configuration for the missing nodes.

(rs0:PRIMARY) admin > cfg = rs.conf()
(rs0:PRIMARY) admin > cfg.members[1] = { _id: 0, host: 'localhost:27000', ... }
(rs0:PRIMARY) admin > cfg.members[2] = { _id: 1, host: 'localhost:27001', ... }

(rs0:PRIMARY) admin > cfg = rs.conf()

(rs0:PRIMARY) admin > cfg.members[1] = { _id: 0, host: 'localhost:27000', ... }

(rs0:PRIMARY) admin > cfg.members[2] = { _id: 1, host: 'localhost:27001', ... }

Three region topology

So, how do we achieve high availability if a whole region goes down? We need at least three regions and all the nodes scattered across them. This way, if a whole region is unavailable, you will only lose a minority of nodes, which will not impact the cluster’s availability.

In our new scenario, we have three regions: Region A, Region B, and Region C. Our nodes are distributed as follows:

region A
localhost:27000

Region B
localhost:27001

Region C
localhost:27002

We will simulate a full region outage for Region C:

(rs0:PRIMARY) admin > rs.status().members.forEach(member => {print('host: ' + member.name + ' stateStr: ' + member.stateStr )})
host: localhost:27000 stateStr: PRIMARY
host: localhost:27001 stateStr: SECONDARY
host: localhost:27002 stateStr: (not reachable/healthy)


(rs0:PRIMARY) admin > rs.status()
{
  set: 'rs0',
 ...
  majorityVoteCount: 2,
  votingMembersCount: 3,
 ...
}

(rs0:PRIMARY) admin > rs.status().members.forEach(member => {print('host: ' + member.name + ' stateStr: ' + member.stateStr )})

host: localhost:27000 stateStr: PRIMARY

host: localhost:27001 stateStr: SECONDARY

host: localhost:27002 stateStr: (not reachable/healthy)

(rs0:PRIMARY) admin > rs.status()

{

set: 'rs0',

...

majorityVoteCount: 2,

votingMembersCount: 3,

...

}

We can see that even though one node is down, we still can meet the required majorityVoteCount, which is still 2.

In this case, we don’t need to change the topology or the configuration and can wait until Region C becomes available again.

Conclusion

To achieve a MongoDB high-availability cluster that can withstand a full regional outage, we must distribute our replica set members across at least three regions. This ensures that we have enough members to maintain a quorum in the event of a full region outage. Automated failover mechanisms, monitoring, testing of the system regularly, and regular backups to further safeguard against data loss and minimize recovery time are also important to ensure the cluster performs reliably under various failure scenarios.

Considerations

It is important to remember that the oplog and flowControl will also play an important role in this scenario, but that is out of the scope of this article. It is worth mentioning that it’s important to have a big oplog window to ensure the bad nodes won’t go out of sync, and a full sync will be needed. As a baseline, a good starting point for the oplog is 24 hrs. Alongside this, flowControl must also be monitored to ensure there won’t be performance degradation on write operations as the replication lag keeps increasing.

MySQL 5.7
Support

Compare Percona to Leading Database Solutions

Software
Downloads

Valkey Contribution

Product Documentation

Resource Hub

Why Percona for MongoDB?

Why Percona for PostgreSQL?

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

MongoDB: High Availability Topology for a Multi-Region Setting

Two region topology

Reconfiguring the replica set

Three region topology

Conclusion

Considerations

Further Reading

Related Blog Articles

RECOMMENDED ARTICLES

Kubernetes Operators Compared: The Key to Scalable, Cost-Efficient Databases

Kubernetes Multi-Cloud Architecture: Building Portable Databases Without Lock-In

Memory Management in MongoDB 8.0: Testing the New TCMalloc

MOST POPULAR ARTICLES

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL Performance Tuning: Maximizing Database Efficiency and Speed

The Ultimate Guide to Open Source Databases

MySQL 5.7 Support

Compare Percona to Leading Database Solutions

Software Downloads

Valkey Contribution

Product Documentation

Resource Hub

Why Percona for MongoDB?

Why Percona for PostgreSQL?

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

MongoDB: High Availability Topology for a Multi-Region Setting

Two region topology

Reconfiguring the replica set

Three region topology

Conclusion

Considerations

Further Reading

Share This Post!

Stay up to date with the Percona Blog

Related Blog Articles

RECOMMENDED ARTICLES

Kubernetes Operators Compared: The Key to Scalable, Cost-Efficient Databases

Kubernetes Multi-Cloud Architecture: Building Portable Databases Without Lock-In

Memory Management in MongoDB 8.0: Testing the New TCMalloc

MOST POPULAR ARTICLES

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL Performance Tuning: Maximizing Database Efficiency and Speed

The Ultimate Guide to Open Source Databases

MySQL 5.7
Support

Software
Downloads