Note: I will cover this subject in my webinar How To Scale with MongoDB on Wednesday, October 18, 2017, at 11:00 am PDT / 2:00 pm EDT (UTC-7).
Sharding is the most complex architecture you can deploy using MongoDB, and there are two main approaches as to when to shard or not. The first is to configure the cluster as soon as possible – when you predict high throughput and fast data growth.
The second says you should use a cluster as the best alternative when the application demands more resources than the replica set can offer (such as low memory, an overloaded disk or high processor load). This approach is more corrective than preventative, but we will discuss that in the future.
1) Disaster recovery plan
Disaster recovery (DR) is a very delicate topic: how long would you tolerate an outage? If necessary, how long would it take you to restore the entire database? Depending on the database size and on disk speed, a backup/restore process might take hours or even days!
There is no hard number in Gigabytes to justify a cluster. But in general, you should engage when the database is more than 200GB the backup and restore processes might take a while to finish.
Let’s consider the case where we have a replica set with a 300GB database. The full restore process might last around four hours, whereas if the database has two shards, it will take about two hours – and depending on the number of shards we can improve that time. Simple math: if there are two shards, the restore process takes half of the time to restore when compared to a single replica set.
2) Hardware limitations
Disk and memory are inexpensive nowadays. However, this is not true when companies need to scale out to high numbers (such as TB of RAM). Suppose your cloud provider can only offer you up to 5,000 IOPS in the disk subsystem, but the application needs more than that to work correctly. To work around this performance limitation, it is better to start a cluster and divide the writes among instances. That said, if there are two shards the application will have 10000 IOPS available to use for writes and reads in the disk subsystem.
3) Storage engine limitations
There are a few storage engine limitations that can be a bottleneck in your use case. MMAPv2 does have a lock per collection, while WiredTiger has tickets that will limit the number of writes and reads happening concurrently. Although we can tweak the number of tickets available in WiredTiger, there is a virtual limit – which means that changing the available tickets might generate processor overload instead of increasing performance. If one of these situations becomes a bottleneck in your system, you start a cluster. Once you shard the collection, you distribute the load/lock among the different instances.
4) Hot data vs. cold data
Several databases only work with a small percentage of the data being stored. This is called hot data or working set. Cold data or historical data is rarely read, and demands considerable system resources when it is. So why spend money on expensive machines that only store cold data or low-value data? With a cluster deployment we can choose where the cold data is stored, and use cheap devices and disks to do so. The same is true for hot data – we can use better machines to have better performance. This methodology also speeds up writes and reads on the hot data, as the indexes are smaller and add less overhead to the system.
5) Geo-distributed data
It doesn’t matter whether this need comes from application design or legal compliance. If the data must stay within continent or country borders, a cluster helps make that happen. It is possible to limit data localization so that it is stored solely in a specific “part of the world.” The number of shards and their geographic positions is not essential for the application, as it only views the database. This is commonly used in worldwide companies for better performance, or simply to comply with the local law.
6) Infrastructure limitations
Infrastructure and hardware limitations are very similar. When thinking about infrastructure, however, we focus on specific cases when the instances should be small. An example is running MongoDB on Mesos. Some providers only offer a few cores and a limited amount of RAM. Even if you are willing to pay more for that, it is not possible to purchase more than they offer as their products. A cluster provides the option to split a small amount of data among a lot of shards, reaching the same performance a big and expensive machine provides.
7) Failure isolation
Consider that a replica set or a single instance holds all the data. If for any reason this instance/replica set goes down, the whole application goes down. In a cluster, if we lose one of the five shards, 80% of the data is still available. Running a few shards helps to isolate failures. Obviously, running a bunch of instances makes the cluster prone to have a failed instance, but as each shard must have at least three instances the probability of the entire shard being down is minimal. For providers that offer different zones, it is good practice to have different members of the shard in different availability zones (or even different regions).
8) Speed up queries
Queries can take too long, depending on the number of reads they perform. In a clustered deployment, queries can run in parallel and speed up the query response time. If a query runs in ten seconds in a replica set, it is very likely that the same query will run in five to six seconds if the cluster has two shards, and so on.
I hope this helps with MongoDB sharding. Having a cluster solves several other problems as well, and we have listed only a few of them. Don’t miss our webinar regarding scaling out MongoDB next Wednesday, October 18, 2017!