In this blog post, we will discuss how we can migrate data from MongoDB Atlas to self-hosted MongoDB. There are a couple of third-party tools in the market to migrate data from Atlas to Pecona Server for MongoDB (PSMDB), like MongoPush, Hummingbird, and MongoShake. Today, we are going to discuss how to use MongoShake and migrate and sync the data from Atlas to PSMDB.

NOTE: These tools are not officially supported by Percona.

MongoShake is a powerful tool that facilitates the migration of data from one MongoDB cluster to another. These are step-by-step instructions on how to install and utilize MongoShake for data migration from Atlas to PSMDB. So, let’s get started!

Prerequisites:

A MongoDB Atlas account. I created a test account (replica set) and loaded sample data with one click in Atlas:

  1. Create an account in Atlas.
  2. Create a cluster.
  3. Once a cluster is created, go to browse collections.
  4. It will ask for load sample data. Once you click on it, you will see the sample data like below.

An EC2 instance with PSMDB installed. I installed PSMDB on the EC2 machine:

Make sure Atlas and PSMDB both have the same DB version (I have also used this tool on MongoDB 4.2, which is already EOL).

PSMDB version:

MongoDB Atlas version:

To install MongoShake, follow these steps:

Step 1: Install Go
Ensure that Go is installed on your system. If not, download it from the official website and follow the installation instructions. I used Amazon Linux 2, so used the below command to install go:

Step 2: Install MongoShake
Open the terminal and run the following command to install MongoShake:

  1. Untar the file; it will create a folder with the name Mongoshake.
  2. cd MongoShake.
  3. Run ./build.sh file.

Once you have installed MongoShake, you need to configure it for the migration process. Here’s how:

  1. Configuration file (collector.conf) will be under conf dir under Mongoshake dir.
  2. In the config file, you can edit the URI for both RS or sharded clusters. Also, the tunnel (how you are migrating the data) method. If you are doing it directly, then the value will be direct. You can edit the log file path and log file name. Below are some important parameters:

    Sync_mode other options: all/full/incr.
  • All means full synchronization + incremental synchronization. (copy the data and apply the oplogs after sync completes). 
  • Full means full synchronization only. (only copy the data).
  • Incr means incremental synchronization only. (only apply the oplog).

There are other parameters as well in the configuration file, which you can tune as per your needs. For example, if you want to read data from the Secondary node and do not want to overwhelm the Primary with the reads, you can set below parameter:

Step 3: Once you are done with the configuration, run MongoShake in a screen session like the one below:

Step 4: Monitor the log file in the log directory to check the progress of migration.

Below is the sample log when you start MongoShake:

You will see the below log once full sync is completed, and incr will start (incr means it will start syncing live data via oplog):

You will see the logs like this when both nodes are in sync (when lag is 0, i.e., tps=0):

Once the full data replication process is complete and both clusters are in sync, you can stop pointing the application to Atlas. Check the logs of MongoShake, and when the lag is 0, as we can see in the above logs, stop the replication/sync from Atlas or stop MongoShake. Verify that the data has been successfully migrated to PSMDB. You can use MongoDB shell or any other client to connect to the PSMDB instance to verify this.

MongoDB Atlas databases and their collection count:


PSDMB databases and their collection count:

Above, you can see we have verified data in PSMDB. Now, update the connection string of the application to point to PSMDB.

NOTE: Sometimes, during the migration process, it is possible for some indexes to replicate. So, during the data verification process, please verify the indexes, and if an index is missing, create that index before the cutover time.

Conclusion

MongoShake simplifies the process of migrating MongoDB data from Atlas to self-hosted MongoDB. Percona experts can assist you with migration as well. By following the steps outlined in this blog, you can seamlessly install, configure, and utilize MongoShake for migrating your data from MongoDB Atlas.

To learn more about the enterprise-grade features available in the license-free Percona Server for MongoDB, we recommend going through our blog MongoDB: Why Pay for Enterprise When Open Source Has You Covered? 

Percona Distribution for MongoDB is a freely available MongoDB database alternative, giving you a single solution that combines the best and most important enterprise components from the open source community, designed and tested to work together.

 

Download Percona Distribution for MongoDB Today!

Subscribe
Notify of
guest

2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
lee

Here is the English translation of your message:
Hello, I am contacting you with a few questions after reviewing your post.

  1. In the mongoshake

    collector.conf file, mongo_urls is for the Source (Atlas) and tunnel.address is for the Target (On-prem). Should these be connected to the mongos of each respective cluster?

  • Example:
  • mongo_urls: Atlas mongos
  • tunnel.address: Target On-prem mongos
  1. If sync_mode is set to full, will there be a significant increase in service resource usage on the Source side?
  • Of course, if we proceed, we plan to use the mongo_connect_mode = secondaryPreferred option to pull data from a secondary.
  1. Regarding the Target Sharded Cluster, I am curious if the Config DB metadata was restored from a dump of the Source’s information, or if only the sharding configuration was matched to be identical.
Gautam

Hi Lee, yes you need to use source and target mongos of each respective cluster.
As I said in my blog post sync_mode full means only copy the data, yes it could increase resource usage on the source side, you need to check how many parallel threads you can run as per your peak load on Primary node, I can’t predict this as I am not aware of your environment workload and setup. Yes config db data will be copied too via mongoshake. this blog is 2 years old. I would recommend to see the notes of mongoshake for the recent versions. As there were more improvements in the tool.