Companies use specific database versions because they’re proven performers or because it’s hard to keep up with frequent releases. But lagging behind has some major issues. When it’s time to upgrade, is it better to update binaries through each major revision or skip versions?
TL;DR: Upgrading a MongoDB cluster using backups and skipping versions is not recommended, but this article demonstrates how to upgrade from v3.6 to v7.0 and the issues you might encounter doing it.
Introduction
Keeping our database environments up-to-date with the vendors is one of the main tasks of database administrators since it is generally recommended to upgrade our systems regularly to take advantage of the fixes and improvements that come with each major release. It is common knowledge that some companies prefer to follow a policy to hold on to a specific version due to its ease of use or because the following versions show some worsening in performance. As new features are added to the database, it can become more complex and slower.
It is sometimes hard to stay up-to-date in a context in which major versions are released frequently. We can find ourselves multiple versions behind, or worse, using a version in production that has reached its lifecycle end-of-life, resulting in exposure to bugs, absence of new features, reduced support, and possible issues with certifications and auditory.
Taking into consideration that the standard and recommended upgrade path of MongoDB is to update the binaries through each major version, when you find yourself using MongoDB v3.6 and wanting to upgrade to something like v6.0 or v7.0, you already imagine going through 3.6 -> 4.0 -> 4.2 -> 4.4 -> 5.0 -> 6.0, with multiple failovers involved in case of replica sets. If the environment can’t have any downtime, numerous driver upgrades are needed, too. It is frequent to see administrators postpone this task indefinitely, mainly when there are multiple servers and environments to upgrade.
Using backups and restores, you could skip the multiple versions of the standard upgrade path with the risk of having various issues with your data during and after the restore and having no official support or documentation for the process. Also, data size becomes a huge factor, as backing up and restoring logically something like 5TB can be a lengthy and painful task.
Environment
MongoDB version
For this experiment, I deployed:
- One replica set with 3 PSMDB nodes in v3.6.13
- One replica set with 3 PSMDB nodes in v7.0.11
- One PBM Agent in v2.5.0 for each node, backing up data to a shared storage location (the ancient PSMDB version is not certified with any PBM, so we are at our own risk here (again)). I used another Percona blog post, Configuring Percona Backup for MongoDB in a Multi-Instances Environment, as the source for this task.
Data
I inserted data divided into two collections containing Binary data types 0, 2, 3, and 4. The second collection had a specific collation (es@collation=search). I also created four indexes and two views per collection.
Process
Preparing for the upgrade
- Backup your MongoDB data to prevent data loss during the upgrade process.
- Check the current MongoDB version using the command $ mongod –version.
- Ensure you have the latest version of the MongoDB drivers installed.
- Plan the upgrade during a predefined maintenance window to minimize downtime.
- Consider the compatibility of your applications with the new MongoDB version.
1. Start the v3.6 nodes and set up PBM
I used mlaunch for this, which helped me spin up three mongod processes quickly.
1 2 3 4 5 6 |
$ mlaunch init --replicaset --nodes 3 --name rsBlogv36 --binarypath /opt/percona_mongodb/3.6.13/bin --dir /bigdisk/pablo.claudino/blog_upgrade/datav36 --auth --username testuser --password testpwd --port 27047 # start services $ sudo systemctl start pbm-agent@27047 $ sudo systemctl start pbm-agent@27048 $ sudo systemctl start pbm-agent@27049 |
2. Insert data, views, and indexes
Here, I used a custom script that generates and inserts 500k extensive documents into the two database collections. Example of a document:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
rsBlogv70 [primary] test> db.getSiblingDB('test').getCollection('collection01').findOne() { _id: ObjectId('6697c03c2582286038a26a25'), name: '±bÂs', lastname: '9jÇi', birthdate: ISODate('1983-05-22T08:59:56.000Z'), age: 91, company: 'EOKgh', receive_sms: false, uuid4: UUID('79007933-be38-4ed8-a731-f563c764db54'), bindata0: Binary.createFromBase64('DYi0s/RKVo6ZAxe2MxejgfYc', 0), bindata2: Binary.createFromBase64('eFl1yDNt', 2), bindata3: Binary.createFromBase64('yIBb2tGOVmktaabDPoxX+s25', 3), self_bio: 'PSÕoÃáKN**d671VÕUR!º0/xhRÃEuCH±3înQé52îE-vh±/óBEj?kz&Oç8PTgÚTwú&Khó8%zmѪ9ãúc2IF0Tx2óJôdSy§ÚÂsG2féYC0â5Ñ40ñ7ÚkaÃiábqSETDÊ6Jâwk(ã?êûhQ+!MÔÎ5Úéw0aɪáãoÉPWOQ§=y§íSvÇ2/2/DowlK0áíÃjãYêdQ0Ç%fÁYô&bK3yqaXgêÁ8XAdêªãîGçªHh(1SeoÉtSÕwETK8hVÔÔÁãel9Q+ÑâYEÎ?vºÚvTê!AhºêûMI09!ôXCWç#ÚÁe%uEGKUla*vwo4êû(JbÎSn4±ñ±íã*cE#í=ôédÑCHrxqq9±PÇ7jvbÍYÍóãÍ*iQAnm7UÊ2ééÊ6l-/aJpãÁr9K3éAweÛÚYsgÕqÂÇãDmxéÉmFt&EpÃha7TCz0ÎkãZlzÍLj±l7zîpiû/îOîyvzxÃg1OPtêmsQ?fQu8lCwwÔÔñewZºTõº74s%ÉL!ûAãrXmÃÎ_ÚNgGwt_íÎIîbu9SóXMftZÍZIá2e_ÔcM_ê1mÉÊSmS§CûÃS§3ÉÂÇñ5TºîÚÑY=IÑuôú%bGj0m(CÉKiM9(É#Dk!ñkDVvx7on/RTOÊÊ&ooRrBzªdñô7MǪ%/ªDya§Hoã)êúÃ1ñjõÍʪÇÉñTÂx=1âÃH!Zñ/cÛ!c=xPoõ=cñªJâ*Õjuu(é+Ç7n)59T&DbK!g#_M!GfpHHºwwRxîîíRÍ8knâMÎ(foeEârãêâumîDÑbçusRásãYra-báît-Âõg±3_óÊÚeãhaÕ1U0G§û&Q&cñSmyUÎôhmA&1/TlrÛÉ8góãçO-úh=Gd!r&ãS6ÁITQÛ±ôOX6KçHÓ9§º/EÛoñ?nûóLFZugdhf?hÇ/ÇtDpNÉ-ÚzêVÔAûÑOÊâ/lmÓáR=l!PÛlF/PbõtLD6±3FÊ/jGE*Zqhlãôx/ÔÊí0ÉÎlSLÂBcpM9)6óáL&Rg&&t&Ea9CFzsÃv6RÇg5LÎÊmúõêéô/t&&#Y-dxhnJjKáôó&3x!ãçQÓbs9Pªr9qo§GR&M#kÃî6ZÃÓ4L*)!/MKlxÁGrL3ó/â2á4gR)xÉZ2FHlh*±ÉVKpóôûI_EN)vb?yóÎÁ%*!x7ÃdJêiã/çóa§%D%AóQÁ' } |
Indexes and views:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# create indexes db.getSiblingDB('test').getCollection('collection01').createIndex({name: 1, lastname: 1} ) db.getSiblingDB('test').getCollection('collection01').createIndex({birthdate: 1, age: 1} ) db.getSiblingDB('test').getCollection('collection01').createIndex({bindata0: 1, bindata2: 1} ) db.getSiblingDB('test').getCollection('collection01').createIndex({bindata3: 1, uuid4: 1} ) db.getSiblingDB('test').getCollection('collection02').createIndex({name: 1, lastname: 1} ) db.getSiblingDB('test').getCollection('collection02').createIndex({birthdate: 1, age: 1} ) db.getSiblingDB('test').getCollection('collection02').createIndex({bindata0: 1, bindata2: 1} ) db.getSiblingDB('test').getCollection('collection02').createIndex({bindata3: 1, uuid4: 1} ) # create views db.getSiblingDB('test').createView('my_2024_data', 'collection01', [{ $match: { birthdate: { $gte: ISODate('2024-01-01 00:00:00Z')}}}]) db.getSiblingDB('test').createView('my_bin_data', 'collection01', [{ $match: { bindata2: BinData(2, 'AAAAAA==')}}]) db.getSiblingDB('test').createView('my_2024_data_02', 'collection02', [{ $match: { birthdate: { $gte: ISODate('2024-01-01 00:00:00Z')}}}]) db.getSiblingDB('test').createView('my_bin_data_02', 'collection02', [{ $match: { bindata2: BinData(2, 'AAAAAA==')}}]) |
3. Take a logical backup
A warning from PBM reminds us that the version is not certified, but the backup is executed even so.
1 2 3 |
$ pbm backup WARNING: Unsupported MongoDB version. PBM works with v4.4, v5.0, v6.0, v7.0 Starting backup '2024-07-15T16:34:01Z'....Backup '2024-07-15T16:34:01Z' to remote store '/bigdisk/pablo.claudino/blog_upgrade/backup' has started |
4. Keep inserting data to create PITR chunks
As PITR is enabled in PBM, I inserted 500k more documents after the backup to have something in the oplog to replay
1 2 3 |
pitr: enabled: true oplogSpanMin: 20 |
5. Startup the v7.0 nodes and setup PBM
1 2 3 4 5 6 |
$ mlaunch init --replicaset --nodes 3 --name rsBlogv70 --binarypath /opt/percona_mongodb/7.0.11/bin --dir /bigdisk/pablo.claudino/blog_upgrade/datav70 --auth --username testuser --password testpwd --port 27057 # start services $ sudo systemctl start pbm-agent@27057 $ sudo systemctl start pbm-agent@27058 $ sudo systemctl start pbm-agent@27059 |
6. Restore the logical backup and the PITR
PBM has a command called pbm status that returns, among other information, the snapshots and point-in-time-intervals available for restoration:
1 2 3 4 5 6 7 |
$ pbm status Snapshots: 2024-07-17T18:57:41Z 1.05GB <logical> [restore_to_time: 2024-07-17T18:57:58Z] 2024-07-17T18:48:31Z 1.05GB <logical> [restore_to_time: 2024-07-17T18:48:50Z] PITR chunks [1.33GB]: 2024-07-17T18:48:51Z - 2024-07-18T12:38:20Z |
So I restored to the latest point-in-time available to make sure that the oplog will be replayed too:
1 2 |
$ pbm restore --time="2024-07-18T12:38:20" --replset-remapping="rsBlogv70=rsBlogv36" Starting restore 2024-07-18T12:49:26.909837891Z to point-in-time 2024-07-18T12:38:20 from '2024-07-17T18:57:41Z'... |
7. Check data on both sides
After the restore was completed, I first checked if counts on both sides matched with stats():
1 2 3 4 5 6 7 8 9 |
# 3.6 Collections: 3, Views: 4, Indexes: 11, Documents: 1500000 Collection: collection01, Count: 1000000, Indexes: 5, Storage Size: 1610.3984MB Collection: collection02, Count: 500000, Indexes: 5, Storage Size: 805.7109MB # 7.0 Collections: 3, Views: 4, Indexes: 11, Documents: 1500000 Collection: collection01, Count: 1000000, Indexes: 5, Storage Size: 1612.6641MB Collection: collection02, Count: 500000, Indexes: 5, Storage Size: 807.1328MB |
After the successful match, I used another script to validate if the data was the same on both collections:
1 2 3 4 5 |
ls_fieldlist.forEach(function(field_name){ if (result1[field_name].toString() != result2[field_name].toString()){ print('_id: ' + result1['_id'] + ', ' + field_name + ' does not match') } }) |
Surprisingly or not, some of the data did not match:
1 2 3 |
$ cat /bigdisk/pablo.claudino/blog_upgrade/scripts/data_compare.out | grep -B1 -a "does not match" | head -n 2 Datetime: "2024-07-19T13:07:29.889Z". Progress (1000): 499000. Total: 1000000 Collection: collection01, _id: 669815aa978f359f3fa26a62, bindata2 does not match: , |
When I visually compared the entries on both sides, it was clear that the binary field with subtype 2 had issues during the restore:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
rsBlogv70 [primary] test> db.getSiblingDB('test').getCollection('collection01').find({_id: ObjectId('669815aa978f359f3fa26a62')},{bindata2:1}) [ { _id: ObjectId('669815aa978f359f3fa26a62'), bindata2: Binary.createFromBase64('AAAAAA==', 2) } ] rsBlogv70 [primary] test> db2.getSiblingDB('test').getCollection('collection01').find({_id: ObjectId('669815aa978f359f3fa26a62')},{bindata2:1}) [ { _id: ObjectId('669815aa978f359f3fa26a62'), bindata2: Binary.createFromBase64('', 2) } ] |
One interesting discovery here is that, as the issues started and continued after the 500k counter, the problem is related to the oplog replay process, not to the dump and restore, since the first 500k documents were inserted before the dump and the last 500k after.
Although investigating the issue with the oplog replay is out of the scope of this post, I ran a couple of backups and restores (with and without PBM), and this issue does not happen if the oplog replay is not involved.
I also redid the process using mongodump + mongorestore with oplog replay instead of PBM. The process finished with success:
1 2 3 4 5 |
/opt/mongodb/mongodb-database-tools-rhel80-x86_64-100.9.5/bin/mongodump --host localhost --port 27047 --username testuser -p "XXXX" --authenticationDatabase=admin --oplog --out '/bigdisk/pablo.claudino/blog_upgrade/dump' /opt/mongodb/mongodb-database-tools-rhel80-x86_64-100.9.5/bin/mongorestore --host localhost --port 27057 --username testuser -p "XXXX" --oplogReplay --authenticationDatabase=admin '/bigdisk/pablo.claudino/blog_upgrade/dump' (...) 2024-07-19T14:17:34.472+0000 780452 document(s) restored successfully. 0 document(s) failed to restore. |
But the data inconsistency problem also happened:
1 2 3 |
Datetime: "2024-07-19T15:02:33.086Z". Progress (1000): 515000. Total: 522289 Collection: collection01, _id: 669a733aa64d8146b5a28928, bindata2 does not match: , Collection: collection01, _id: 669a733aa64d8146b5a28933, bindata2 does not match: , |
Post-upgrade tasks and best practices
- Update your applications to use the latest MongoDB drivers.
- Take a backup of your data after the upgrade.
- Monitor the MongoDB instance for any issues or errors.
- Consider implementing a regular backup and maintenance schedule.
- Keep your MongoDB instance up-to-date with the latest security patches and updates.
Conclusion
Even with the apparent success of the restore process in a different engine version, we can see that some silent errors might happen, and some data can be changed in the process, which we can only discover after some time and after some incorrect data was used in our applications, reports, forms and such.
The tested, approved, supported, and recommended way is to follow the instructions in the documentation, and if you need some help, Percona Experts are available to consult you. If you want to proceed on this path anyway, you should validate your data and test the applications with it thoroughly in a non-productive environment before you move on with it to production.
Can I know how to check the consistency after transferring MongoDB?
Ex) Collection Count
And in the case of MongoDB in general, to what level do we check the consistency after transfer?
In my case, MYSQL monitors the number of objects / ROWCOUNT per table.
Hi Lee,
In the case reported in the blog, a row count is not enough to ensure data consistency, as all the counts match, but the data in the source server is different from the one in the target due to some issue in the oplog replay.
To really ensure data consistency, you would need to compare the source’s documents’ contents with the ones from the target after the restore.