Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

MongoDB Rollback in replicaset

May 1, 2018

Author

Vinodh Krishnaswamy

Insight for DBAs

Insight for Developers

MongoDB

Share this Post:

In this blog post, we’ll look at how MongoDB rollback works during replicaset failovers.

In recent versions, MongoDB has provided lots of features related to replicaset and automatic failover. When it comes to failover, the next question that arises is “How does MongoDB ROLLBACK work during replicaset failover?”

If a PRIMARY member (say node A) stepped down with some data writes that were executed but not replicated to the SECONDARY members yet, then a ROLLBACK occurs on the former PRIMARY A when it rejoins the replicaset. I’ll explain below how the ROLLBACK works!

ROLLBACK Scenario:

ROLLBACK is rare in a replicaset as MongoDB tries to avoid it by replicating the operations from PRIMARY to SECONDARY without delay, under normal conditions. Most of the time ROLLBACK occurs in the event of network partitioning, or if SECONDARY members can’t keep up with the throughput of operations on the former PRIMARY.

ROLLBACK Process:

We will see the process with a test. I have used Docker for this test with the MongoDB 3.2 Jessie version to setup a replicaset with members mongo1 – A, mongo2 – B, mongo3 – C and set Priority 10 to A. Now A is PRIMARY as expected in the replicaset. We need to write some data into A and create a network partition scenario with B and C at the same time. For that, I inserted 25000 documents into A and made it out of network at the same time.

Terminal 1 (A’s mongo prompt):

my-mongo-set:PRIMARY> for (var i = 1; i <= 25000; i++) {
...    db.testData.insert( { x : i } )
... }
WriteResult({ "nInserted" : 1 })
my-mongo-set:PRIMARY> db.testD2018-03-30T17:34:51.455+0530 I NETWORK  [thread1] trying reconnect to 127.0.0.1:30001 (127.0.0.1) failed
2018-03-30T17:34:51.464+0530 I NETWORK  [thread1] reconnect 127.0.0.1:30001 (127.0.0.1) ok
                      db.testD
admin.testD
my-mongo-set:SECONDARY> rs.slaveOk()
my-mongo-set:SECONDARY> db.testData.count()
25000

my-mongo-set:PRIMARY> for (var i = 1; i <= 25000; i++) {

... db.testData.insert( { x : i } )

... }

WriteResult({ "nInserted" : 1 })

my-mongo-set:PRIMARY> db.testD2018-03-30T17:34:51.455+0530 I NETWORK [thread1] trying reconnect to 127.0.0.1:30001 (127.0.0.1) failed

2018-03-30T17:34:51.464+0530 I NETWORK [thread1] reconnect 127.0.0.1:30001 (127.0.0.1) ok

db.testD

admin.testD

my-mongo-set:SECONDARY> rs.slaveOk()

my-mongo-set:SECONDARY> db.testData.count()

25000

Terminal2:

Vinodhs-MBP:~ vinodhkrish$ docker ps
CONTAINER ID        IMAGE                       COMMAND                  CREATED             STATUS              PORTS                           NAMES
b27d82ac2439        mongo:3.2.19-jessie         "docker-entrypoint.s…"   2 days ago          Up 1 days           0.0.0.0:30003->27017/tcp        mongo3
2b39f9e41973        mongo:3.2.19-jessie         "docker-entrypoint.s…"   2 days ago          Up 1 days           0.0.0.0:30002->27017/tcp        mongo2
105b6df757d7        mongo:3.2.19-jessie         "docker-entrypoint.s…"   2 days ago          Up 1 days           0.0.0.0:30001->27017/tcp        mongo1
Vinodhs-MBP:~ vinodhkrish$ docker network disconnect my-mongo-cluster mongo1

Vinodhs-MBP:~ vinodhkrish$ docker ps

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES

b27d82ac2439 mongo:3.2.19-jessie "docker-entrypoint.s…" 2 days ago Up 1 days 0.0.0.0:30003->27017/tcp mongo3

2b39f9e41973 mongo:3.2.19-jessie "docker-entrypoint.s…" 2 days ago Up 1 days 0.0.0.0:30002->27017/tcp mongo2

105b6df757d7 mongo:3.2.19-jessie "docker-entrypoint.s…" 2 days ago Up 1 days 0.0.0.0:30001->27017/tcp mongo1

Vinodhs-MBP:~ vinodhkrish$ docker network disconnect my-mongo-cluster mongo1

The member A has now become as SECONDARY, because it couldn’t reach other members in the replicaset. On the other side, B and C members see that A is not reachable and then B is elected as PRIMARY. We could see that some inserts from former A replicated to B before the network split happens.

(B node)

my-mongo-set:PRIMARY> db.testData.count()
15003

1 2	my-mongo-set:PRIMARY> db.testData.count() 15003

Now do some write operations in current PRIMARY – B and then let node A join the network back by joining the container back to the bridge network. You can observe below that the node A’s member states are changing in the mongo prompt. (I just connected to A and pressed ENTER/RETURN button many times to see the member states, or you can see them in the log file):

(A node)

Vinodhs-MacBook-Pro:mongodb-osx-x86_64-3.2.19 vinodhkrish$ ./bin/mongo 127.0.0.1:30001/admin
MongoDB shell version: 3.2.19
connecting to: 127.0.0.1:30001/admin
my-mongo-set:ROLLBACK> 
my-mongo-set:RECOVERING> 
my-mongo-set:SECONDARY> 
my-mongo-set:SECONDARY> 
my-mongo-set:PRIMARY>

Vinodhs-MacBook-Pro:mongodb-osx-x86_64-3.2.19 vinodhkrish$ ./bin/mongo 127.0.0.1:30001/admin

MongoDB shell version: 3.2.19

connecting to: 127.0.0.1:30001/admin

my-mongo-set:ROLLBACK>

my-mongo-set:RECOVERING>

my-mongo-set:SECONDARY>

my-mongo-set:PRIMARY>

ROLLBACK Internal

From MongoDB point of view, we will see the replicaset process to understand what happened above. Normally the SECONDARY member syncs the oplog entries from its syncSource (the member from where the data is replicated) by using oplogFetcher. The OplogFetcher first sends a find() command to the syncSource’s oplog, and then follows with a series of getMores on the cursor. When node A rejoins the replicaset, node A’s oplogFetcher first sends find() command to syncSource node B and check it has a greater than or equal predicate on the timestamp of the last oplog entry it has fetched. Usually the find() command should at least return one doc due to the greater than or equal predicate. If not, it means that the syncSource is behind and so it will not replicate from it and look for other syncSource.

In this case, A’s oplogFetcher sees that the first document returned from node B does not match the last entry in its oplog. That means node A’s oplog has diverged from node B’s and it should go into ROLLBACK.

Node A first finds the common point between its oplog and its syncSource B’s oplog. It then goes through all of the operations in its oplog back to the common point and figures out how to undo them. Here, 9997 inserts are missed from B and C nodes, and so these documents will be recovered from A’s oplog.

2018-03-30T12:08:37.160+0000 I REPL     [rsBackgroundSync] <strong>Starting rollback due to OplogStartMissing</strong>: our last op time fetched: (term: 4, timestamp: Mar 30 12:03:52:139). source's GTE: (term: 5, timestamp: Mar 30 12:05:37:1) hashes: (3789163619674410187/3226093795606474294)
2018-03-30T12:08:37.160+0000 I REPL     [rsBackgroundSync] rollback 0
2018-03-30T12:08:37.160+0000 I REPL     [ReplicationExecutor] transition to ROLLBACK
2018-03-30T12:08:37.163+0000 I REPL     [rsBackgroundSync] beginning rollback
2018-03-30T12:08:37.163+0000 I REPL     [rsBackgroundSync] rollback 1
2018-03-30T12:08:37.164+0000 I REPL     [rsBackgroundSync] <strong>rollback 2 FindCommonPoint</strong>
2018-03-30T12:08:37.166+0000 I REPL     [rsBackgroundSync] rollback our last optime:   Mar 30 12:03:52:139
2018-03-30T12:08:37.166+0000 I REPL     [rsBackgroundSync] rollback their last optime: Mar 30 12:08:17:1c5
2018-03-30T12:08:37.166+0000 I REPL     [rsBackgroundSync] rollback diff in end of log times: -265 seconds
2018-03-30T12:08:37.269+0000 I REPL     [rsBackgroundSync] <strong>rollback common point is (term: 4, timestamp: Mar 30 12:03:46:d2)</strong>
2018-03-30T12:08:37.269+0000 I REPL     [rsBackgroundSync] rollback 3 fixup
2018-03-30T12:08:38.240+0000 I REPL     [rsBackgroundSync] rollback 3.5
2018-03-30T12:08:38.240+0000 I REPL     [rsBackgroundSync] Setting minvalid to (term: 5, timestamp: Mar 30 12:08:17:1c5)
2018-03-30T12:08:38.241+0000 I REPL     [rsBackgroundSync] rollback 4 n:1
2018-03-30T12:08:38.241+0000 I REPL     [rsBackgroundSync] rollback 4.6
2018-03-30T12:08:38.241+0000 I REPL     [rsBackgroundSync] rollback 4.7
2018-03-30T12:08:38.391+0000 I REPL     [rsBackgroundSync] rollback 5 <strong>d:9997</strong> u:0
2018-03-30T12:08:38.391+0000 I REPL     [rsBackgroundSync] rollback 6
2018-03-30T12:08:38.394+0000 I REPL     [rsBackgroundSync] rollback done
2018-03-30T12:08:38.396+0000 I REPL     [rsBackgroundSync] rollback finished

2018-03-30T12:08:37.160+0000 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 4, timestamp: Mar 30 12:03:52:139). source's GTE: (term: 5, timestamp: Mar 30 12:05:37:1) hashes: (3789163619674410187/3226093795606474294)

2018-03-30T12:08:37.160+0000 I REPL [rsBackgroundSync] rollback 0

2018-03-30T12:08:37.160+0000 I REPL [ReplicationExecutor] transition to ROLLBACK

2018-03-30T12:08:37.163+0000 I REPL [rsBackgroundSync] beginning rollback

2018-03-30T12:08:37.163+0000 I REPL [rsBackgroundSync] rollback 1

2018-03-30T12:08:37.164+0000 I REPL [rsBackgroundSync] rollback 2 FindCommonPoint

2018-03-30T12:08:37.166+0000 I REPL [rsBackgroundSync] rollback our last optime: Mar 30 12:03:52:139

2018-03-30T12:08:37.166+0000 I REPL [rsBackgroundSync] rollback their last optime: Mar 30 12:08:17:1c5

2018-03-30T12:08:37.166+0000 I REPL [rsBackgroundSync] rollback diff in end of log times: -265 seconds

2018-03-30T12:08:37.269+0000 I REPL [rsBackgroundSync] rollback common point is (term: 4, timestamp: Mar 30 12:03:46:d2)

2018-03-30T12:08:37.269+0000 I REPL [rsBackgroundSync] rollback 3 fixup

2018-03-30T12:08:38.240+0000 I REPL [rsBackgroundSync] rollback 3.5

2018-03-30T12:08:38.240+0000 I REPL [rsBackgroundSync] Setting minvalid to (term: 5, timestamp: Mar 30 12:08:17:1c5)

2018-03-30T12:08:38.241+0000 I REPL [rsBackgroundSync] rollback 4 n:1

2018-03-30T12:08:38.241+0000 I REPL [rsBackgroundSync] rollback 4.6

2018-03-30T12:08:38.241+0000 I REPL [rsBackgroundSync] rollback 4.7

2018-03-30T12:08:38.391+0000 I REPL [rsBackgroundSync] rollback 5 d:9997 u:0

2018-03-30T12:08:38.391+0000 I REPL [rsBackgroundSync] rollback 6

2018-03-30T12:08:38.394+0000 I REPL [rsBackgroundSync] rollback done

2018-03-30T12:08:38.396+0000 I REPL [rsBackgroundSync] rollback finished

ROLLBACK data

Where would these 9997 recovered documents go? MongoDB writes these ROLLBACK documents under the rollback directory in the dbpath. These recovered collections are named with namespace as the prefix and the date time as the suffix in their names. These are in BSON format, and we need to convert into JSON to analyze them so the plan for the next course of action can be done. In our case, the testData collection’s rollback data are as follows:

root@105b6df757d7:/# cd /data/db
root@105b6df757d7:/data/db# <strong>ls -l rollback/</strong>
total 324K
-rw-r--r-- 1 mongodb mongodb 323K Mar 30 12:08 admin.testData.2018-03-30T12-08-38.0.bson

root@105b6df757d7:/# cd /data/db

root@105b6df757d7:/data/db# ls -l rollback/

total 324K

-rw-r--r-- 1 mongodb mongodb 323K Mar 30 12:08 admin.testData.2018-03-30T12-08-38.0.bson

root@105b6df757d7:/data/db/rollback# <strong>bsondump</strong> admin.testData.2018-03-30T12-08-38.0.bson > rollback.json
2018-03-30T12:13:00.033+0000 9997 objects found
root@105b6df757d7:/data/db/rollback# head rollback.json 
{"_id":{"$oid":"5abe279f97044083811b5975"},"x":15004.0}
{"_id":{"$oid":"5abe279f97044083811b5976"},"x":15005.0}
{"_id":{"$oid":"5abe279f97044083811b5977"},"x":15006.0}
{"_id":{"$oid":"5abe279f97044083811b5978"},"x":15007.0}
{"_id":{"$oid":"5abe279f97044083811b5979"},"x":15008.0}
{"_id":{"$oid":"5abe279f97044083811b5980"},"x":15009.0}
{"_id":{"$oid":"5abe279f97044083811b5981"},"x":15010.0}

root@105b6df757d7:/data/db/rollback# bsondump admin.testData.2018-03-30T12-08-38.0.bson > rollback.json

2018-03-30T12:13:00.033+0000 9997 objects found

root@105b6df757d7:/data/db/rollback# head rollback.json

{"_id":{"$oid":"5abe279f97044083811b5975"},"x":15004.0}

{"_id":{"$oid":"5abe279f97044083811b5976"},"x":15005.0}

{"_id":{"$oid":"5abe279f97044083811b5977"},"x":15006.0}

{"_id":{"$oid":"5abe279f97044083811b5978"},"x":15007.0}

{"_id":{"$oid":"5abe279f97044083811b5979"},"x":15008.0}

{"_id":{"$oid":"5abe279f97044083811b5980"},"x":15009.0}

{"_id":{"$oid":"5abe279f97044083811b5981"},"x":15010.0}

That’s it? Now check the counts of the testData collection in node A:

my-mongo-set:PRIMARY> db.testData.count()
<strong>15003</strong>

1 2	my-mongo-set:PRIMARY> db.testData.count() <strong>15003</strong>

So the records 9997 which were rollbacked into the rollback directory would also be dropped from the collection. This ensures the data consistency throughout the replicaset.

How to avoid ROLLBACK – writeConcern

The default writeConcern in the replicaSet is w:1., i.e., When a client writes into a replicaSet, then it receives an acknowledgment from the PRIMARY alone and won’t wait for SECONDARY members’ acknowledgment. If you want to avoid the ROLLBACK scenario in your environment, then you have to use the {w:majority} or {w:n}, where 1 > n <= (no. of members in your replica set). This ensures that the writes are propagated to so many members of the replica set before sending the acknowledgment to the client. This solves the problem of ROLLBACK.

But please be careful that you are not giving higher value to writeConcern, because it also affects the write performance. The acknowledgment needs to be received from the number of members mentioned in the value. The value {w:majority} provides the acknowledgement that write operations have propagated to the majority of voting nodes, including the primary and is suitable for most of the environments.

ROLLBACK – Limitation

The main thing to note here is that mongod will not rollback more than 300MB data. In such cases, we need to manually check the instance to recover the data. You can see the below message in mongod.log in such cases:

[replica set sync] replSet syncThread: 13410 replSet too much data to roll back

				1

						[replica set sync] replSet syncThread: 13410 replSet too much data to roll back

Understanding this simple ROLLBACK background helps us to decide what needs to be done with the rollbacked data. It also helps us avoid such scenarios, because data is data and is very important!