MongoDB Disaster, Snapshot Restore and Point-in-time Replay

Mistakes can happen. If only we could go back in time to the very second before that mistake was made.

Act 1: The Disaster

Plain text version for those who cannot run the asciicast above:

akira@perc01:/data$ #OK, let's get this party started!
akira@perc01:/data$ # The frontend has been shut down for 20 mins so they can
akira@perc01:/data$ # update that part, and I can update the schema in he 
akira@perc01:/data$ # backend simultaneously.
akira@perc01:/data$ #Easy-peasy ...
akira@perc01:/data$ date
Tue Jul  2 13:34:09 JST 2019
akira@perc01:/data$ #Just set my auth details.(NO PEEKING!)
akira@perc01:/data$ conn_args="--host localhost:27017 --username akira --password secret --authenticationDatabase admin"
akira@perc01:/data$ mongo ${conn_args} --quiet
testrs:PRIMARY> use payments
switched to db payments
testrs:PRIMARY> show collections
TheImportantCollection
testrs:PRIMARY> //Ah, there it is. Time to work!
testrs:PRIMARY> db.TheImportantCollection.count()
174662
testrs:PRIMARY> db.TheImportantCollection.findOne()
{
    "_id" : 0,
    "customer" : {
        "fn" : "Smith",
        "gn" : "Ken",
        "city" : "Georgevill",
        "street1" : "1 Wishful St.",
        "postcode" : "45031"
    },
    "order_ids" : [ ]
}
testrs:PRIMARY> //Ah, there it is. The "customer" object that has the 
testrs:PRIMARY> //address fields in it. We're going to move those out.
testrs:PRIMARY> //Copy the whole collection, adding the new "addresses" array
testrs:PRIMARY> var counter = 0;
testrs:PRIMARY> db.TheImportantCollection.find().forEach(function(d) {
...   d["adresses"] = [ ];
...   db.TheImportantCollectionV2.insert(d);
...   counter += 1;
...   if (counter % 25000 == 0) { print(counter + " updates done"); }
... });
25000 updates done
50000 updates done
75000 updates done
100000 updates done
125000 updates done
150000 updates done
testrs:PRIMARY> //Cool. Let's look at the temp table
testrs:PRIMARY> db.TheImportantCollectionV2.findOne()
{
    "_id" : 0,
    "customer" : {
        "fn" : "Smith",
        "gn" : "Ken",
        "city" : "Georgevill",
        "street1" : "1 Wishful St.",
        "postcode" : "45031"
    },
    "order_ids" : [ ],
    "adresses" : [ ]
}
testrs:PRIMARY> //?AH!!
testrs:PRIMARY> //typo. I misspelled "addresses".
testrs:PRIMARY> //I'll just drop this and go again
testrs:PRIMARY> db.TheImportantCollectionV2.remove({})
WriteResult({ "nRemoved" : 174662 })
testrs:PRIMARY> //ooops. Why did I bother deleting the docs?
testrs:PRIMARY> //I need to *drop* the collection
testrs:PRIMARY> db.TheImportantCollection.drop()
true
testrs:PRIMARY> //!!!!
testrs:PRIMARY> //Wait!
testrs:PRIMARY> show collections
TheImportantCollectionV2
testrs:PRIMARY> //...
testrs:PRIMARY> //I've done a bad    thing ....
testrs:PRIMARY> //Let me see
testrs:PRIMARY> //in the oplog
testrs:PRIMARY> use local
switched to db local
testrs:PRIMARY> db.oplog.rs.findOne({"o.drop": "TheImportantCollection"})
{
    "ts" : Timestamp(1562042272, 1),
    "t" : NumberLong(6),
    "h" : NumberLong("6726633412398410781"),
    "v" : 2,
    "op" : "c",
    "ns" : "payments.$cmd",
    "ui" : UUID("abc9c1f9-71c0-45ea-aeba-ea239b975a95"),
    "wall" : ISODate("2019-07-02T04:37:52.171Z"),
    "o" : {
        "drop" : "TheImportantCollection"
    }
}
testrs:PRIMARY> //AH. 1562042272, you are the worst unix epoch second of my
testrs:PRIMARY> // life.
testrs:PRIMARY>

akira@perc01:/data$ #OK, let's get this party started!

akira@perc01:/data$ # The frontend has been shut down for 20 mins so they can

akira@perc01:/data$ # update that part, and I can update the schema in he

akira@perc01:/data$ # backend simultaneously.

akira@perc01:/data$ #Easy-peasy ...

akira@perc01:/data$ date

Tue Jul 2 13:34:09 JST 2019

akira@perc01:/data$ #Just set my auth details.(NO PEEKING!)

akira@perc01:/data$ conn_args="--host localhost:27017 --username akira --password secret --authenticationDatabase admin"

akira@perc01:/data$ mongo ${conn_args} --quiet

testrs:PRIMARY> use payments

switched to db payments

testrs:PRIMARY> show collections

TheImportantCollection

testrs:PRIMARY> //Ah, there it is. Time to work!

testrs:PRIMARY> db.TheImportantCollection.count()

174662

testrs:PRIMARY> db.TheImportantCollection.findOne()

{

"_id" : 0,

"customer" : {

"fn" : "Smith",

"gn" : "Ken",

"city" : "Georgevill",

"street1" : "1 Wishful St.",

"postcode" : "45031"

"order_ids" : [ ]

}

testrs:PRIMARY> //Ah, there it is. The "customer" object that has the

testrs:PRIMARY> //address fields in it. We're going to move those out.

testrs:PRIMARY> //Copy the whole collection, adding the new "addresses" array

testrs:PRIMARY> var counter = 0;

testrs:PRIMARY> db.TheImportantCollection.find().forEach(function(d) {

... d["adresses"] = [ ];

... db.TheImportantCollectionV2.insert(d);

... counter += 1;

... if (counter % 25000 == 0) { print(counter + " updates done"); }

... });

25000 updates done

50000 updates done

75000 updates done

100000 updates done

125000 updates done

150000 updates done

testrs:PRIMARY> //Cool. Let's look at the temp table

testrs:PRIMARY> db.TheImportantCollectionV2.findOne()

{

"_id" : 0,

"customer" : {

"fn" : "Smith",

"gn" : "Ken",

"city" : "Georgevill",

"street1" : "1 Wishful St.",

"postcode" : "45031"

"order_ids" : [ ],

"adresses" : [ ]

}

testrs:PRIMARY> //?AH!!

testrs:PRIMARY> //typo. I misspelled "addresses".

testrs:PRIMARY> //I'll just drop this and go again

testrs:PRIMARY> db.TheImportantCollectionV2.remove({})

WriteResult({ "nRemoved" : 174662 })

testrs:PRIMARY> //ooops. Why did I bother deleting the docs?

testrs:PRIMARY> //I need to *drop* the collection

testrs:PRIMARY> db.TheImportantCollection.drop()

true

testrs:PRIMARY> //!!!!

testrs:PRIMARY> //Wait!

testrs:PRIMARY> show collections

TheImportantCollectionV2

testrs:PRIMARY> //...

testrs:PRIMARY> //I've done a bad thing ....

testrs:PRIMARY> //Let me see

testrs:PRIMARY> //in the oplog

testrs:PRIMARY> use local

switched to db local

testrs:PRIMARY> db.oplog.rs.findOne({"o.drop": "TheImportantCollection"})

{

"ts" : Timestamp(1562042272, 1),

"t" : NumberLong(6),

"h" : NumberLong("6726633412398410781"),

"v" : 2,

"op" : "c",

"ns" : "payments.$cmd",

"ui" : UUID("abc9c1f9-71c0-45ea-aeba-ea239b975a95"),

"wall" : ISODate("2019-07-02T04:37:52.171Z"),

"o" : {

"drop" : "TheImportantCollection"

}

testrs:PRIMARY> //AH. 1562042272, you are the worst unix epoch second of my

testrs:PRIMARY> // life.

testrs:PRIMARY>

Act 2: Time travel with a Snapshot restore + Oplog replay

Plain text version for those who cannot run the asciicast above:

akira@perc01:/data$ #OK, OK, this is bad. I dropped TheImportantCollection
akira@perc01:/data$ #Breathe. Breathe Akira.
akira@perc01:/data$ #Right! Backups!
akira@perc01:/data$ #I have backups!
akira@perc01:/data$ ls /backups/
20190624_2300  20190626_2300  20190628_2300
20190625_2300  20190627_2300  20190629_2300
akira@perc01:/data$ #OK, I have one from 23:00 JST ... which is a while ago.
akira@perc01:/data$ #I can use the latest backup, then roll forward from
akira@perc01:/data$ # there using this neat thing you can do with
akira@perc01:/data$ #  mongorestore (the standard mongo utils command)
akira@perc01:/data$ #You can replay a dumped oplog bson file 
akira@perc01:/data$ # on a primary like it was receiving as a secondary
akira@perc01:/data$ #Just as a secondary can catch up from a primary so
akira@perc01:/data$ # far the oplog window of time goes, a primary can
akira@perc01:/data$ # be given an oplog history to replay, using this 'trick'
akira@perc01:/data$ #(Not really a trick, but let's call it that)
akira@perc01:/data$ 
akira@perc01:/data$ #
akira@perc01:/data$ #But, before doing ANYTHING with the backups,
akira@perc01:/data$ # get a full dump of the oplog of the *live* replicaset
akira@perc01:/data$ # first
akira@perc01:/data$ conn_args="--host localhost:27017 --username akira --password secret --authenticationDatabase admin"
akira@perc01:/data$ mongodump ${conn_args} -d local -c oplog.rs --out /data/oplog_dump_full
2019-07-02T13:50:02.713+0900	writing local.oplog.rs to 
2019-07-02T13:50:03.635+0900	done dumping local.oplog.rs (825815 documents)
akira@perc01:/data$ #Oh wait.
akira@perc01:/data$ #We *do* need a trick
akira@perc01:/data$ #v3.6 and v4.0 added some system collections that cause
akira@perc01:/data$ # mongorestore to fail, no matter what we do.
akira@perc01:/data$ # This is just a 3.6 and 4.0 issue hopefully, but 4.2's 
akira@perc01:/data$ #  behaviour is not known at this date.
akira@perc01:/data$ #I'll do the dump again, removing these two collections
akira@perc01:/data$ mongodump ${conn_args} -d local -c oplog.rs \
> --query '{"ns": {"$nin": ["config.system.sessions", "config.cache.collections"]}}' --out /data/oplog_dump_full
2019-07-02T13:52:08.841+0900	writing local.oplog.rs to 
2019-07-02T13:52:10.010+0900	done dumping local.oplog.rs (825781 documents)
akira@perc01:/data$ #So that was Trick #1. Removing those 2 specific 
akira@perc01:/data$ # config.* collections.
akira@perc01:/data$ #Now for #Trick 2
akira@perc01:/data$ #mongodump puts the dumped oplog.rs.bson file in subdirectory "local" like that is a whole DB to restore. But you don't do a restore of local like any other DB, it doesn't work like that.
akira@perc01:/data$ #So we MUST get  rid of subdirectory structure and just keep the single *.bson file
akira@perc01:/data$ ls -lR /data/oplog_dump_full/
/data/oplog_dump_full/:
total 146032
drwxr-xr-x 2 akira akira        57 Jul  2 13:50 local
-rw-r--r-- 1 akira akira 149534510 Jul  2 10:26 oplog.rs.bson

/data/oplog_dump_full/local:
total 233008
-rw-r--r-- 1 akira akira 238596091 Jul  2 13:52 oplog.rs.bson
-rw-r--r-- 1 akira akira       120 Jul  2 13:52 oplog.rs.metadata.json
akira@perc01:/data$ mv /data/oplog_dump_full/local/oplog.rs.bson /data/oplog_dump_full/
akira@perc01:/data$ rm -rf /data/oplog_dump_full/local
akira@perc01:/data$ ls -lR /data/oplog_dump_full/
/data/oplog_dump_full/:
total 233004
-rw-r--r-- 1 akira akira 238596091 Jul  2 13:52 oplog.rs.bson
akira@perc01:/data$ #OK.
akira@perc01:/data$ #Now let's look at this oplog. Does it go back as far as
akira@perc01:/data$ # the latest backup snapshot or more?
akira@perc01:/data$ ls /backups/ | tail -n 1
20190629_2300
akira@perc01:/data$ #By the way that is my JST timezone, not UTC
akira@perc01:/data$ #let's see ... check the bson file's first timestamp
akira@perc01:/data$ bsondump /data/oplog_dump_full/oplog.rs.bson 2>/dev/null | head -n 1
{"ts":{"$timestamp":{"t":1561727517,"i":1}},"h":{"$numberLong":"212971303912007811"},"v":2,"op":"n","ns":"","wall":{"$date":"2019-06-28T13:11:57.633Z"},"o":{"msg":"initiating set"}}
akira@perc01:/data$ #I see the epoch timestamp there: 1561727517
akira@perc01:/data$ date -d @1561727517
Fri Jun 28 22:11:57 JST 2019
akira@perc01:/data$ #Ah, good, that's before 20190629_2300
akira@perc01:/data$ #We can do a oplog replay
akira@perc01:/data$ #Just for sanity's sake let's look for that "drop"
akira@perc01:/data$ #  command that is the disaster we want to avoid replaying
akira@perc01:/data$ bsondump /data/oplog_dump_full/oplog.rs.bson 2>/dev/null | grep drop | grep '\bTheImportantCollection\b' | tail -n 1
{"ts":{"$timestamp":{"t":1562042272,"i":1}},"t":{"$numberLong":"6"},"h":{"$numberLong":"6726633412398410781"},"v":2,"op":"c","ns":"payments.$cmd","ui":{"$binary":"q8nB+XHARequuuojm5dalQ==","$type":"04"},"wall":{"$date":"2019-07-02T04:37:52.171Z"},"o":{"drop":"TheImportantCollection"}}
akira@perc01:/data$ #Let's see it was 1562042272, the worst epoch second of my
akira@perc01:/data$ # my life. Let's not go there again!
akira@perc01:/data$ #Time to shut the live replicaset down, restore a snapshot
akira@perc01:/data$ # backup from 20190629_2300
akira@perc01:/data$ ps -C mongod -o pid,args
  PID COMMAND
18119 mongod -f /data/n1/mongod.conf
18195 mongod -f /data/n2/mongod.conf
18225 mongod -f /data/n3/mongod.conf
akira@perc01:/data$ kill 18119 18195 18225
akira@perc01:/data$ ps -C mongod -o pid,args
  PID COMMAND
18119 mongod -f /data/n1/mongod.conf
akira@perc01:/data$ ps -C mongod -o pid,args
  PID COMMAND
18119 mongod -f /data/n1/mongod.conf
akira@perc01:/data$ ps -C mongod -o pid,args
  PID COMMAND
18119 mongod -f /data/n1/mongod.conf
akira@perc01:/data$ ps -C mongod -o pid,args
  PID COMMAND
akira@perc01:/data$ #OK, shutdown
akira@perc01:/data$ /data/dba_scripts/our_restore_script.sh 
usage: /data/dba_scripts/our_restore_script.sh XXXXXX
Choose one of these subdirectory names from /backups/:
  20190624_2300
  20190625_2300
  20190626_2300
  20190627_2300
  20190628_2300
  20190629_2300
akira@perc01:/data$ /data/dba_scripts/our_restore_script.sh 20190629_2300
Stopping mongod nodes
Restoring backup 20190629_2300 to one node dbpath
Restarting
about to fork child process, waiting until server is ready for connections.
forked process: 21776
child process started successfully, parent exiting
akira@perc01:/data$ ps -C mongod -o pid,args
  PID COMMAND
21776 mongod -f /data/n1/mongod.conf
akira@perc01:/data$ #I'll start the secondaries too
akira@perc01:/data$ rm -rf /data/n2/data/*
akira@perc01:/data$ mongod -f /data/n2/mongod.conf 
about to fork child process, waiting until server is ready for connections.
forked process: 21859
child process started successfully, parent exiting
akira@perc01:/data$ rm -rf /data/n3/data/*
akira@perc01:/data$ mongod -f /data/n3/mongod.conf 
about to fork child process, waiting until server is ready for connections.
forked process: 21896
child process started successfully, parent exiting
akira@perc01:/data$ ps -C mongod -o pid,args
  PID COMMAND
21776 mongod -f /data/n1/mongod.conf
21859 mongod -f /data/n2/mongod.conf
21896 mongod -f /data/n3/mongod.conf
akira@perc01:/data$ #I'm going to check my important collection is there again
akira@perc01:/data$ mongo ${conn_args} 
MongoDB shell version v4.0.10
connecting to: mongodb://localhost:27017/?authSource=admin&gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("e5aa9b27-f26b-4c73-bdc1-bdaf494cf7ab") }
MongoDB server version: 4.0.10
testrs:PRIMARY> use payments
switched to db payments
testrs:PRIMARY> show collections
TheImportantCollection
testrs:PRIMARY> //YES
testrs:PRIMARY> db.TheImportantCollection.count()
174662
testrs:PRIMARY> db.TheImportantCollection.findOne()
{
	"_id" : 0,
	"customer" : {
		"fn" : "Smith",
		"gn" : "Ken",
		"city" : "Georgevill",
		"street1" : "1 Wishful St.",
		"postcode" : "45031"
	},
	"order_ids" : [ ]
}
testrs:PRIMARY> //Yes yes yes ... I live
testrs:PRIMARY> 
bye
akira@perc01:/data$ #So the data is back ... but only some time way in the
akira@perc01:/data$ # past. I want to replay up until ...
akira@perc01:/data$ bad_drop_epoch_sec=1562042272
akira@perc01:/data$ #Trick 3: mongorestore always expects a directory name
akira@perc01:/data$ #We don't need any directories, but it's just hard-coded
akira@perc01:/data$ # to expect one. So let's make one. Can be anywhere
akira@perc01:/data$ # Just not a subdirectory under the oplog dump location please, that will confuse it maybe
akira@perc01:/data$ mkdir /tmp/fake_empty_dir
mkdir: cannot create directory ‘/tmp/fake_empty_dir’: File exists
akira@perc01:/data$ #Ah, I got it already.
akira@perc01:/data$ ls /tmp/fake_empty_dir
akira@perc01:/data$ mongorestore ${conn_args} \
>   --oplogReplay \
>    --oplogFile /data/oplog_dump_full/oplog.rs.bson \
>   --oplogLimit ${bad_drop_epoch_sec}:0  \
>   --stopOnError /tmp/fake_empty_dir
2019-07-02T14:04:35.742+0900	preparing collections to restore from
2019-07-02T14:04:35.742+0900	replaying oplog
2019-07-02T14:04:38.715+0900	oplog  5.47MB
2019-07-02T14:04:41.715+0900	oplog  11.0MB
2019-07-02T14:04:44.715+0900	oplog  16.6MB
2019-07-02T14:04:47.715+0900	oplog  22.2MB
2019-07-02T14:04:50.715+0900	oplog  27.6MB
2019-07-02T14:04:53.715+0900	oplog  32.8MB
2019-07-02T14:04:56.715+0900	oplog  37.9MB
2019-07-02T14:04:59.715+0900	oplog  43.0MB
2019-07-02T14:05:02.715+0900	oplog  48.3MB
2019-07-02T14:05:05.715+0900	oplog  53.9MB
2019-07-02T14:05:08.715+0900	oplog  59.5MB
2019-07-02T14:05:11.715+0900	oplog  65.1MB
2019-07-02T14:05:14.715+0900	oplog  70.2MB
2019-07-02T14:05:17.715+0900	oplog  75.0MB
2019-07-02T14:05:20.715+0900	oplog  79.6MB
2019-07-02T14:05:23.715+0900	oplog  84.1MB
2019-07-02T14:05:26.715+0900	oplog  88.5MB
2019-07-02T14:05:29.715+0900	oplog  93.0MB
2019-07-02T14:05:32.715+0900	oplog  97.6MB
2019-07-02T14:05:35.715+0900	oplog  101MB
2019-07-02T14:05:38.715+0900	oplog  104MB
2019-07-02T14:05:41.715+0900	oplog  107MB
2019-07-02T14:05:44.715+0900	oplog  110MB
2019-07-02T14:05:47.715+0900	oplog  113MB
2019-07-02T14:05:50.715+0900	oplog  115MB
2019-07-02T14:05:53.715+0900	oplog  118MB
2019-07-02T14:05:56.715+0900	oplog  123MB
2019-07-02T14:05:59.715+0900	oplog  128MB
2019-07-02T14:06:02.715+0900	oplog  133MB
2019-07-02T14:06:05.715+0900	oplog  138MB
2019-07-02T14:06:08.715+0900	oplog  142MB
2019-07-02T14:06:11.715+0900	oplog  146MB
2019-07-02T14:06:14.715+0900	oplog  151MB
2019-07-02T14:06:17.715+0900	oplog  156MB
2019-07-02T14:06:20.715+0900	oplog  161MB
2019-07-02T14:06:23.715+0900	oplog  166MB
2019-07-02T14:06:26.715+0900	oplog  171MB
2019-07-02T14:06:29.715+0900	oplog  176MB
2019-07-02T14:06:32.715+0900	oplog  181MB
2019-07-02T14:06:35.715+0900	oplog  186MB
2019-07-02T14:06:38.715+0900	oplog  192MB
2019-07-02T14:06:41.715+0900	oplog  197MB
2019-07-02T14:06:44.715+0900	oplog  201MB
2019-07-02T14:06:47.715+0900	oplog  204MB
2019-07-02T14:06:50.715+0900	oplog  206MB
2019-07-02T14:06:53.715+0900	oplog  209MB
2019-07-02T14:06:56.715+0900	oplog  211MB
2019-07-02T14:06:59.715+0900	oplog  213MB
2019-07-02T14:07:02.715+0900	oplog  216MB
2019-07-02T14:07:05.715+0900	oplog  218MB
2019-07-02T14:07:08.715+0900	oplog  220MB
2019-07-02T14:07:11.715+0900	oplog  223MB
2019-07-02T14:07:14.715+0900	oplog  225MB
2019-07-02T14:07:17.715+0900	oplog  227MB
2019-07-02T14:07:17.753+0900	oplog  227MB
2019-07-02T14:07:17.753+0900	done
akira@perc01:/data$ #Yay! I hope! Let's check
akira@perc01:/data$ mongo ${conn_args} 
MongoDB shell version v4.0.10
connecting to: mongodb://localhost:27017/?authSource=admin&gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("302f2c26-7416-4e18-bd02-1bd67626d062") }
MongoDB server version: 4.0.10
testrs:PRIMARY> use payments
switched to db payments
testrs:PRIMARY> show collections
TheImportantCollection
TheImportantCollectionV2
testrs:PRIMARY> //Yes! both there!
testrs:PRIMARY> db.TheImportantCollection.count()
174662
testrs:PRIMARY> //plus the 'V2' table I was working on when I made my 
testrs:PRIMARY> // 'fat thumb' mistake
testrs:PRIMARY> //There we go, a point-in-time restore from a snapshot
testrs:PRIMARY> // backup + a mongorestore --oplogReplay --oplogFile
testrs:PRIMARY> // operation.
testrs:PRIMARY> //Hold on for one last trick (which I didn't have to use today)
testrs:PRIMARY> // Trick #4: ultimate permissions are sometimes needed.
testrs:PRIMARY> // The config.system.sessions and config.transactions(?) 
testrs:PRIMARY> //  system collections are currently unreplayable (3.6, 4.0,
testrs:PRIMARY> //  4.2. TBD).
testrs:PRIMARY> // They are not the only system collections that you can stuck on, because systems collections are mostly not covered by the "backup" and "restore" built-in roles.
testrs:PRIMARY> // E.g. if you are replaying updates to the admin.system.users
testrs:PRIMARY> //  collection that will fail.
testrs:PRIMARY> // But you can allow if you make a *custom* role that grants
testrs:PRIMARY> //  "anyAction" on "anyResource" (see the docs), and grant that
testrs:PRIMARY> // to your backup and restore user, that will make it possible for those to succeed too.
testrs:PRIMARY> //good night
testrs:PRIMARY>

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

akira@perc01:/data$ #OK, OK, this is bad. I dropped TheImportantCollection

akira@perc01:/data$ #Breathe. Breathe Akira.

akira@perc01:/data$ #Right! Backups!

akira@perc01:/data$ #I have backups!

akira@perc01:/data$ ls /backups/

20190624_2300 20190626_2300 20190628_2300

20190625_2300 20190627_2300 20190629_2300

akira@perc01:/data$ #OK, I have one from 23:00 JST ... which is a while ago.

akira@perc01:/data$ #I can use the latest backup, then roll forward from

akira@perc01:/data$ # there using this neat thing you can do with

akira@perc01:/data$ # mongorestore (the standard mongo utils command)

akira@perc01:/data$ #You can replay a dumped oplog bson file

akira@perc01:/data$ # on a primary like it was receiving as a secondary

akira@perc01:/data$ #Just as a secondary can catch up from a primary so

akira@perc01:/data$ # far the oplog window of time goes, a primary can

akira@perc01:/data$ # be given an oplog history to replay, using this 'trick'

akira@perc01:/data$ #(Not really a trick, but let's call it that)

akira@perc01:/data$

akira@perc01:/data$ #

akira@perc01:/data$ #But, before doing ANYTHING with the backups,

akira@perc01:/data$ # get a full dump of the oplog of the *live* replicaset

akira@perc01:/data$ # first

akira@perc01:/data$ conn_args="--host localhost:27017 --username akira --password secret --authenticationDatabase admin"

akira@perc01:/data$ mongodump ${conn_args} -d local -c oplog.rs --out /data/oplog_dump_full

2019-07-02T13:50:02.713+0900 writing local.oplog.rs to

2019-07-02T13:50:03.635+0900 done dumping local.oplog.rs (825815 documents)

akira@perc01:/data$ #Oh wait.

akira@perc01:/data$ #We *do* need a trick

akira@perc01:/data$ #v3.6 and v4.0 added some system collections that cause

akira@perc01:/data$ # mongorestore to fail, no matter what we do.

akira@perc01:/data$ # This is just a 3.6 and 4.0 issue hopefully, but 4.2's

akira@perc01:/data$ # behaviour is not known at this date.

akira@perc01:/data$ #I'll do the dump again, removing these two collections

akira@perc01:/data$ mongodump ${conn_args} -d local -c oplog.rs \

> --query '{"ns": {"$nin": ["config.system.sessions", "config.cache.collections"]}}' --out /data/oplog_dump_full

2019-07-02T13:52:08.841+0900 writing local.oplog.rs to

2019-07-02T13:52:10.010+0900 done dumping local.oplog.rs (825781 documents)

akira@perc01:/data$ #So that was Trick #1. Removing those 2 specific

akira@perc01:/data$ # config.* collections.

akira@perc01:/data$ #Now for #Trick 2

akira@perc01:/data$ #mongodump puts the dumped oplog.rs.bson file in subdirectory "local" like that is a whole DB to restore. But you don't do a restore of local like any other DB, it doesn't work like that.

akira@perc01:/data$ #So we MUST get rid of subdirectory structure and just keep the single *.bson file

akira@perc01:/data$ ls -lR /data/oplog_dump_full/

/data/oplog_dump_full/:

total 146032

drwxr-xr-x 2 akira akira 57 Jul 2 13:50 local

-rw-r--r-- 1 akira akira 149534510 Jul 2 10:26 oplog.rs.bson

/data/oplog_dump_full/local:

total 233008

-rw-r--r-- 1 akira akira 238596091 Jul 2 13:52 oplog.rs.bson

-rw-r--r-- 1 akira akira 120 Jul 2 13:52 oplog.rs.metadata.json

akira@perc01:/data$ mv /data/oplog_dump_full/local/oplog.rs.bson /data/oplog_dump_full/

akira@perc01:/data$ rm -rf /data/oplog_dump_full/local

akira@perc01:/data$ ls -lR /data/oplog_dump_full/

/data/oplog_dump_full/:

total 233004

-rw-r--r-- 1 akira akira 238596091 Jul 2 13:52 oplog.rs.bson

akira@perc01:/data$ #OK.

akira@perc01:/data$ #Now let's look at this oplog. Does it go back as far as

akira@perc01:/data$ # the latest backup snapshot or more?

akira@perc01:/data$ ls /backups/ | tail -n 1

20190629_2300

akira@perc01:/data$ #By the way that is my JST timezone, not UTC

akira@perc01:/data$ #let's see ... check the bson file's first timestamp

akira@perc01:/data$ bsondump /data/oplog_dump_full/oplog.rs.bson 2>/dev/null | head -n 1

{"ts":{"$timestamp":{"t":1561727517,"i":1}},"h":{"$numberLong":"212971303912007811"},"v":2,"op":"n","ns":"","wall":{"$date":"2019-06-28T13:11:57.633Z"},"o":{"msg":"initiating set"}}

akira@perc01:/data$ #I see the epoch timestamp there: 1561727517

akira@perc01:/data$ date -d @1561727517

Fri Jun 28 22:11:57 JST 2019

akira@perc01:/data$ #Ah, good, that's before 20190629_2300

akira@perc01:/data$ #We can do a oplog replay

akira@perc01:/data$ #Just for sanity's sake let's look for that "drop"

akira@perc01:/data$ # command that is the disaster we want to avoid replaying

akira@perc01:/data$ bsondump /data/oplog_dump_full/oplog.rs.bson 2>/dev/null | grep drop | grep '\bTheImportantCollection\b' | tail -n 1

{"ts":{"$timestamp":{"t":1562042272,"i":1}},"t":{"$numberLong":"6"},"h":{"$numberLong":"6726633412398410781"},"v":2,"op":"c","ns":"payments.$cmd","ui":{"$binary":"q8nB+XHARequuuojm5dalQ==","$type":"04"},"wall":{"$date":"2019-07-02T04:37:52.171Z"},"o":{"drop":"TheImportantCollection"}}

akira@perc01:/data$ #Let's see it was 1562042272, the worst epoch second of my

akira@perc01:/data$ # my life. Let's not go there again!

akira@perc01:/data$ #Time to shut the live replicaset down, restore a snapshot

akira@perc01:/data$ # backup from 20190629_2300

akira@perc01:/data$ ps -C mongod -o pid,args

PID COMMAND

18119 mongod -f /data/n1/mongod.conf

18195 mongod -f /data/n2/mongod.conf

18225 mongod -f /data/n3/mongod.conf

akira@perc01:/data$ kill 18119 18195 18225

akira@perc01:/data$ ps -C mongod -o pid,args

PID COMMAND

18119 mongod -f /data/n1/mongod.conf

akira@perc01:/data$ ps -C mongod -o pid,args

PID COMMAND

18119 mongod -f /data/n1/mongod.conf

akira@perc01:/data$ ps -C mongod -o pid,args

PID COMMAND

18119 mongod -f /data/n1/mongod.conf

akira@perc01:/data$ ps -C mongod -o pid,args

PID COMMAND

akira@perc01:/data$ #OK, shutdown

akira@perc01:/data$ /data/dba_scripts/our_restore_script.sh

usage: /data/dba_scripts/our_restore_script.sh XXXXXX

Choose one of these subdirectory names from /backups/:

20190624_2300

20190625_2300

20190626_2300

20190627_2300

20190628_2300

20190629_2300

akira@perc01:/data$ /data/dba_scripts/our_restore_script.sh 20190629_2300

Stopping mongod nodes

Restoring backup 20190629_2300 to one node dbpath

Restarting

about to fork child process, waiting until server is ready for connections.

forked process: 21776

child process started successfully, parent exiting

akira@perc01:/data$ ps -C mongod -o pid,args

PID COMMAND

21776 mongod -f /data/n1/mongod.conf

akira@perc01:/data$ #I'll start the secondaries too

akira@perc01:/data$ rm -rf /data/n2/data/*

akira@perc01:/data$ mongod -f /data/n2/mongod.conf

about to fork child process, waiting until server is ready for connections.

forked process: 21859

child process started successfully, parent exiting

akira@perc01:/data$ rm -rf /data/n3/data/*

akira@perc01:/data$ mongod -f /data/n3/mongod.conf

about to fork child process, waiting until server is ready for connections.

forked process: 21896

child process started successfully, parent exiting

akira@perc01:/data$ ps -C mongod -o pid,args

PID COMMAND

21776 mongod -f /data/n1/mongod.conf

21859 mongod -f /data/n2/mongod.conf

21896 mongod -f /data/n3/mongod.conf

akira@perc01:/data$ #I'm going to check my important collection is there again

akira@perc01:/data$ mongo ${conn_args}

MongoDB shell version v4.0.10

connecting to: mongodb://localhost:27017/?authSource=admin&gssapiServiceName=mongodb

Implicit session: session { "id" : UUID("e5aa9b27-f26b-4c73-bdc1-bdaf494cf7ab") }

MongoDB server version: 4.0.10

testrs:PRIMARY> use payments

switched to db payments

testrs:PRIMARY> show collections

TheImportantCollection

testrs:PRIMARY> //YES

testrs:PRIMARY> db.TheImportantCollection.count()

174662

testrs:PRIMARY> db.TheImportantCollection.findOne()

{

"_id" : 0,

"customer" : {

"fn" : "Smith",

"gn" : "Ken",

"city" : "Georgevill",

"street1" : "1 Wishful St.",

"postcode" : "45031"

"order_ids" : [ ]

}

testrs:PRIMARY> //Yes yes yes ... I live

testrs:PRIMARY>

bye

akira@perc01:/data$ #So the data is back ... but only some time way in the

akira@perc01:/data$ # past. I want to replay up until ...

akira@perc01:/data$ bad_drop_epoch_sec=1562042272

akira@perc01:/data$ #Trick 3: mongorestore always expects a directory name

akira@perc01:/data$ #We don't need any directories, but it's just hard-coded

akira@perc01:/data$ # to expect one. So let's make one. Can be anywhere

akira@perc01:/data$ # Just not a subdirectory under the oplog dump location please, that will confuse it maybe

akira@perc01:/data$ mkdir /tmp/fake_empty_dir

mkdir: cannot create directory ‘/tmp/fake_empty_dir’: File exists

akira@perc01:/data$ #Ah, I got it already.

akira@perc01:/data$ ls /tmp/fake_empty_dir

akira@perc01:/data$ mongorestore ${conn_args} \

> --oplogReplay \

> --oplogFile /data/oplog_dump_full/oplog.rs.bson \

> --oplogLimit ${bad_drop_epoch_sec}:0 \

> --stopOnError /tmp/fake_empty_dir

2019-07-02T14:04:35.742+0900 preparing collections to restore from

2019-07-02T14:04:35.742+0900 replaying oplog

2019-07-02T14:04:38.715+0900 oplog 5.47MB

2019-07-02T14:04:41.715+0900 oplog 11.0MB

2019-07-02T14:04:44.715+0900 oplog 16.6MB

2019-07-02T14:04:47.715+0900 oplog 22.2MB

2019-07-02T14:04:50.715+0900 oplog 27.6MB

2019-07-02T14:04:53.715+0900 oplog 32.8MB

2019-07-02T14:04:56.715+0900 oplog 37.9MB

2019-07-02T14:04:59.715+0900 oplog 43.0MB

2019-07-02T14:05:02.715+0900 oplog 48.3MB

2019-07-02T14:05:05.715+0900 oplog 53.9MB

2019-07-02T14:05:08.715+0900 oplog 59.5MB

2019-07-02T14:05:11.715+0900 oplog 65.1MB

2019-07-02T14:05:14.715+0900 oplog 70.2MB

2019-07-02T14:05:17.715+0900 oplog 75.0MB

2019-07-02T14:05:20.715+0900 oplog 79.6MB

2019-07-02T14:05:23.715+0900 oplog 84.1MB

2019-07-02T14:05:26.715+0900 oplog 88.5MB

2019-07-02T14:05:29.715+0900 oplog 93.0MB

2019-07-02T14:05:32.715+0900 oplog 97.6MB

2019-07-02T14:05:35.715+0900 oplog 101MB

2019-07-02T14:05:38.715+0900 oplog 104MB

2019-07-02T14:05:41.715+0900 oplog 107MB

2019-07-02T14:05:44.715+0900 oplog 110MB

2019-07-02T14:05:47.715+0900 oplog 113MB

2019-07-02T14:05:50.715+0900 oplog 115MB

2019-07-02T14:05:53.715+0900 oplog 118MB

2019-07-02T14:05:56.715+0900 oplog 123MB

2019-07-02T14:05:59.715+0900 oplog 128MB

2019-07-02T14:06:02.715+0900 oplog 133MB

2019-07-02T14:06:05.715+0900 oplog 138MB

2019-07-02T14:06:08.715+0900 oplog 142MB

2019-07-02T14:06:11.715+0900 oplog 146MB

2019-07-02T14:06:14.715+0900 oplog 151MB

2019-07-02T14:06:17.715+0900 oplog 156MB

2019-07-02T14:06:20.715+0900 oplog 161MB

2019-07-02T14:06:23.715+0900 oplog 166MB

2019-07-02T14:06:26.715+0900 oplog 171MB

2019-07-02T14:06:29.715+0900 oplog 176MB

2019-07-02T14:06:32.715+0900 oplog 181MB

2019-07-02T14:06:35.715+0900 oplog 186MB

2019-07-02T14:06:38.715+0900 oplog 192MB

2019-07-02T14:06:41.715+0900 oplog 197MB

2019-07-02T14:06:44.715+0900 oplog 201MB

2019-07-02T14:06:47.715+0900 oplog 204MB

2019-07-02T14:06:50.715+0900 oplog 206MB

2019-07-02T14:06:53.715+0900 oplog 209MB

2019-07-02T14:06:56.715+0900 oplog 211MB

2019-07-02T14:06:59.715+0900 oplog 213MB

2019-07-02T14:07:02.715+0900 oplog 216MB

2019-07-02T14:07:05.715+0900 oplog 218MB

2019-07-02T14:07:08.715+0900 oplog 220MB

2019-07-02T14:07:11.715+0900 oplog 223MB

2019-07-02T14:07:14.715+0900 oplog 225MB

2019-07-02T14:07:17.715+0900 oplog 227MB

2019-07-02T14:07:17.753+0900 oplog 227MB

2019-07-02T14:07:17.753+0900 done

akira@perc01:/data$ #Yay! I hope! Let's check

akira@perc01:/data$ mongo ${conn_args}

MongoDB shell version v4.0.10

connecting to: mongodb://localhost:27017/?authSource=admin&gssapiServiceName=mongodb

Implicit session: session { "id" : UUID("302f2c26-7416-4e18-bd02-1bd67626d062") }

MongoDB server version: 4.0.10

testrs:PRIMARY> use payments

switched to db payments

testrs:PRIMARY> show collections

TheImportantCollection

TheImportantCollectionV2

testrs:PRIMARY> //Yes! both there!

testrs:PRIMARY> db.TheImportantCollection.count()

174662

testrs:PRIMARY> //plus the 'V2' table I was working on when I made my

testrs:PRIMARY> // 'fat thumb' mistake

testrs:PRIMARY> //There we go, a point-in-time restore from a snapshot

testrs:PRIMARY> // backup + a mongorestore --oplogReplay --oplogFile

testrs:PRIMARY> // operation.

testrs:PRIMARY> //Hold on for one last trick (which I didn't have to use today)

testrs:PRIMARY> // Trick #4: ultimate permissions are sometimes needed.

testrs:PRIMARY> // The config.system.sessions and config.transactions(?)

testrs:PRIMARY> // system collections are currently unreplayable (3.6, 4.0,

testrs:PRIMARY> // 4.2. TBD).

testrs:PRIMARY> // They are not the only system collections that you can stuck on, because systems collections are mostly not covered by the "backup" and "restore" built-in roles.

testrs:PRIMARY> // E.g. if you are replaying updates to the admin.system.users

testrs:PRIMARY> // collection that will fail.

testrs:PRIMARY> // But you can allow if you make a *custom* role that grants

testrs:PRIMARY> // "anyAction" on "anyResource" (see the docs), and grant that

testrs:PRIMARY> // to your backup and restore user, that will make it possible for those to succeed too.

testrs:PRIMARY> //good night

testrs:PRIMARY>

The ‘TLDR’

The oplog of the damaged replicaset is your valuable, idempotent history if you have a backup from a recent enough time to apply it on.

Identify your disaster operation’s timestamp value in the oplog.
Before shutting the damaged replicaset down: mongodump connection-args --db local --collection oplog.rs
- (Necessary workaround #1) use a --query '{"ns": {"$nin": ["config.system.sessions", "config.transactions", "config.transaction_coordinators"]}}' argument to avoid transaction-related system collections from v3.6 and v4.0 (and maybe 4.2+ too) that can’t be restored.
(Necessary workaround #2) Get rid of the subdirectory structure mongodump makes and keep just the oplog.rs.bson file.
(Necessary workaround #3) Make a fake, empty directory somewhere else too, to trick mongorestore later.
Use bsondump oplog.rs.bson | head -n 1 to check that this oplog starts before the time of your last backup
Shut the damaged DB down.
Restore to the latest backup before the disaster.
(Possibly-required workaround #4) If the oplog updates other system collections, create a user-defined role that grants anyAction on anyResource and grants it to your user as well. (See special section on system collections below.)
Replay up to but not including the disaster second: mongorestore connection-args –oplogReplay –oplogFile oplog.rs.bson –oplogLimit disaster_epoch_sec:0 /tmp/fake_empty_directory

See the ‘Act 2’ video for the details.

So how did that work?

If you’re having the kind of disaster presented in this article I assume you are already familiar with the mongodump and mongorestore tools and MongoDB Oplog idempotency. Taking that for granted let’s go to the next level of detail.

The `applyOps` command – Kinda secret; Actually public

In theory you could iterate oplog documents and write an application that runs an insert command for an “i” op, an update for the “u” ops, various different commands for the “c” op, etc, but the simpler way is to submit them as they are (well almost exactly as they are) using the applyOps command, and this is what the mongorestore tool does.

The permission to run applyOps is granted to the “restore” role for all non-system collections, and there is no ‘reject if a primary’ rule. So you can make a primary apply oplog docs like a secondary does.

N.b. for some system collections, the “restore” role is not enough. See the bottom section for more details.

It might seem a bit strange users can have this privilege but without it, there would be no convenient way for dump-restore tools to guarantee consistency. The “consistency” here means all that the restored data will be exactly as it was at some point in time – the end of the dump – and not contain earlier versions of documents from some midpoint time during the dumping process.

Achieving that data consistency is why the --oplog option for mongodump was created, and why mongorestore has the matching --oplogReplay option. (Those two options should be on by default i.m.o. but they are not). The short oplog span made during a normal dump will be at <dump_directory>/oplog.rs.bson, but the --oplogFile argument lets you choose any arbitrary path.

`--oplogLimit`

We could have limited the oplog docs during mongodump to only include those before the disaster time with –query parameter such as the following:

mongodump ... --query '{"ts": {"$lt": new Timestamp(1560915610, 0)}}' ...

But --oplogLimit makes it easier. You can dump everything, but then use --oplogLimit <epoch_sec_value>[:<counter>] when you run mongorestore with the –oplogReplay argument.

If you’re getting confused about whether it’s UTC or your server timezone – it’s UTC. All timestamps inside MongoDB are UTC if they represent ‘wall clock’ times, and for ‘logical clocks’ timezone is a non-applicable concept.

When the oplog includes system collection updates

In the built-in roles documentation, inserted after the usual and mostly fair warnings on why you should not grant users the most powerful internal role, comes this extra note that tells you what you actually need to do to allow oplog-replay updates on all system collections too:

If you need access to all actions on all resources, for example to run applyOps commands … create a user-defined role that grants anyAction on anyResource and ensure that only the users who need access to these operations have this access.

Translation: if your oplog replay fails because it hit a system collection update the “restore” role doesn’t cover, upgrade your user to be able to run with all the privileges that a secondary runs oplog replication with.

use admin
db.createRole({
  "role": "CustomAllPowersRole", 
  "privileges": [ 
    { "resource": { "anyResource": true }, "actions": [ "anyAction" ] }, 
  ],
  "roles": [ ] });
db.grantRolesToUser("<bk_and_restore_username>", [ "CustomAllPowersRole" ])

//For afterwards:
//use admin
//db.revokeRolesFromUser("<bk_and_restore_username>", [ "CustomAllPowersRole" ])
//db.dropRole("CustomAllPowersRole")

use admin

db.createRole({

"role": "CustomAllPowersRole",

"privileges": [

{ "resource": { "anyResource": true }, "actions": [ "anyAction" ] },

"roles": [ ] });

db.grantRolesToUser("<bk_and_restore_username>", [ "CustomAllPowersRole" ])

//For afterwards:

//use admin

//db.revokeRolesFromUser("<bk_and_restore_username>", [ "CustomAllPowersRole" ])

//db.dropRole("CustomAllPowersRole")

Alternatively, to granting the role shown above, you could restart the mongod with security disabled; in this mode, all operations work without access control restrictions.

It’s not quite as simple as that though because transaction stuff is currently (v3.6, v4.0) throwing a spanner in the works. So I’ve found explicitly excluding config.system.sessions and config.transactions during mongodump is the best way to avoid those updates. They are logically unnecessary in a restore because the sessions/transactions finished when the replica set was completely shut down.

Learn more about Percona Server for MongoDB

1 Comment

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Kay Agahd

5 years ago

Akira, well written, very helpful, thank you very much!

MySQL 5.7
Support

Compare Percona to Leading Database Solutions

Software
Downloads

Valkey Contribution

Product Documentation

Resource Hub

Why Percona for MongoDB?

Why Percona for PostgreSQL?

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

MongoDB Disaster, Snapshot Restore and Point-in-time Replay

Act 1: The Disaster

Act 2: Time travel with a Snapshot restore + Oplog replay

The ‘TLDR’

So how did that work?

The `applyOps` command – Kinda secret; Actually public

`--oplogLimit`

When the oplog includes system collection updates

Related Blog Articles

RECOMMENDED ARTICLES

Urgent Security Update: Patching “Mongobleed” (CVE-2025-14847) in Percona Server for MongoDB

CVE-2025-14847 (MongoBleed) — A High-Severity Memory Leak in MongoDB

Percona Operator for MongoDB in 2025: Making Distributed MongoDB More Predictable on Kubernetes

MOST POPULAR ARTICLES

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL Performance Tuning: Maximizing Database Efficiency and Speed

The Ultimate Guide to Open Source Databases

MySQL 5.7 Support

Compare Percona to Leading Database Solutions

Software Downloads

Valkey Contribution

Product Documentation

Resource Hub

Why Percona for MongoDB?

Why Percona for PostgreSQL?

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

MongoDB Disaster, Snapshot Restore and Point-in-time Replay

Act 1: The Disaster

Act 2: Time travel with a Snapshot restore + Oplog replay

The ‘TLDR’

So how did that work?

The applyOps command – Kinda secret; Actually public

--oplogLimit

When the oplog includes system collection updates

Share This Post!

Stay up to date with the Percona Blog

Related Blog Articles

RECOMMENDED ARTICLES

Urgent Security Update: Patching “Mongobleed” (CVE-2025-14847) in Percona Server for MongoDB

CVE-2025-14847 (MongoBleed) — A High-Severity Memory Leak in MongoDB

Percona Operator for MongoDB in 2025: Making Distributed MongoDB More Predictable on Kubernetes

MOST POPULAR ARTICLES

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL Performance Tuning: Maximizing Database Efficiency and Speed

The Ultimate Guide to Open Source Databases

MySQL 5.7
Support

Software
Downloads

The `applyOps` command – Kinda secret; Actually public

`--oplogLimit`