MongoDB 3.4: Sharding ImprovementsTim Vaillancourt
Let’s go over what MongoDB Sharding “is” at a simplified, high level.
The concept of “sharding” exists to allow MongoDB to scale to very large data sets that may exceed the available resources of a single node or replica set. When a MongoDB collection is sharding-enabled, it’s data is broken into ranges called “chunks.” These are intended to be evenly distributed across many nodes or replica sets (called “shards”). MongoDB computes the ranges of a given chunk based on a mandatory document-key called a “shard key.” The shard key is used in all read and write queries to route a database request to the right shard.
The MongoDB ecosystem introduced additional architectural components so that this could happen:
- A shard. A single MongoDB node or replica set used for storing the cluster data. There are usually many shards in a cluster and more shards can be added/removed to scale.
- A “mongos” router. A sharding-aware router for handling client traffic. There can be one or more mongos instances in a cluster.
- The “config servers”. Used for storing the cluster metadata. Config servers are essentially regular MongoDB servers dedicated to storing the cluster metadata within the “config” database. Database traffic does not access these servers, only the mongos.
Under sharding, all client database traffic is directed to one or more of the mongos router process(es), which use the cluster metadata, such as the chunk ranges, the members of the cluster, etc., to service requests while keeping the details of sharding transparent to the client driver. Without the cluster metadata, the routing process(es) do not know where the data is, making the config servers a critical component of sharding. Due to this, at least three config servers are required for full fault tolerance.
Sharding: Chunk Balancer
To ensure chunks are always balanced among the cluster shards, a process named the “chunk balancer” (or simply “balancer”) runs periodically, moving chunks from shard to shard to ensure data is evenly distributed. When a chunk is balanced, the balancer doesn’t actually move any data, it merely coordinates the transfer between the source and destination shard and updates the cluster metadata when chunks have moved.
Before MongoDB 3.4, the chunk balancer would run on whichever mongos process could acquire a cluster-wide balancer lock first. From my perspective this was a poor architectural decision for these reasons:
- Predictability. Due to the first-to-lock nature, the mongos process running the balancer is essentially chosen at random. This can complicate troubleshooting as you try to chase down which mongos process is the active balancer to see what it is doing, it’s logs, etc. As a further example: it is common in some deployments for the mongos process to run locally on application servers and in large organizations it is common for a DBA to not have access to application hosts – something I’ve ran into many times myself.
- Efficiency. mongos was supposed to be a stateless router, not a critical administrative process! As all client traffic passes in-line through the mongos process, it is important for it to be as simple, reliable and efficient as possible.
- Reliability. in order to operate, the mongos process must read and write cluster metadata hosted within the config servers. As mongos is almost always running on a physically separate host from the config servers, any disruption (network, hardware, etc) in between the balancer and config server nodes will break balancing!
Luckily, MongoDB 3.4 has come to check this problem (and many others) off of my holiday wish list!
MongoDB 3.4: Chunk Balancer Moved to Config Servers
In MongoDB 3.4, the chunk balancer was moved to the Primary config server, bringing these solutions to my concerns about the chunk balancer:
- Predictability. The balancer is always running in a single, predictable place: the primary config server.
- Efficiency. Removing the balancer from “mongos” allows it to worry about routing only. Also, as config servers are generally dedicated nodes that are never directly hit by client database traffic, in my opinion this is a more efficient place for the balancer to run.
- Reliability. Perhaps the biggest win I see with this change is the balancer can no longer lose connectivity with the cluster metadata that is stored on separate hosts. The balancer now runs inside the same node as the metadata!
- Centralized. As a freebie, now all the background/administrative components of Sharding are in one place!
Note: although we expect the overhead of the balancer to be negligible, keep in mind that a minor overhead is added to the config server Primary-node due to this change.
See more about this change here: https://docs.mongodb.com/manual/release-notes/3.4/#balancer-on-config-server-primary
MongoDB 3.4: Required Config Server Replica Set
In MongoDB releases before 3.2, the set of cluster config servers received updates using a mode called Sync Cluster Connection Config (SCCC) to ensure all nodes received the same change. This essentially meant that any updates to cluster metadata would be sent N x times from the mongos to the config servers in a fan-out pattern. This is another legacy design choice that always confused me, considering MongoDB already has facilities for reliably replicating data: MongoDB Replication. Plus without transactions in MongoDB, there are some areas where SCCC can fail.
Luckily MongoDB 3.2 introduced Replica-Set based config servers as an optional feature. This moved us away from the SCCC fan-out mode to traditional replication and write concerns for consistency. This brought many benefits: rebuilding a config server node became simpler, backups became more straightforward and flexible and the move towards a consistent method of achieving consistent updates simplified the architecture.
MongoDB 3.4 requires Replica-Set based config servers, and removed the SCCC mode entirely. This might require some changes for some, but I think the benefits outweigh the cost. For more details on how to upgrade from SCCC to Replica-Set based config servers, see this article.
Note: the balancer in MongoDB 3.4 always runs on the config server that is the ‘PRIMARY’ of the replica set.
MongoDB 3.4: Parallel Balancing
As intended, MongoDB Sharded Clusters can get very big, with 10s, 100s or even 1000s of shards. Historically MongoDB’s balancer worked in serial, meaning it could only coordinate 1 x chunk balancing round at any given time within the cluster. On very large clusters, this limitation poses a huge throughput limitation on balancing: all chunk moves have to wait in a serial queue.
In MongoDB 3.4, the chunk balancer can now perform several chunk moves in parallel given they’re between a unique source and destination shard. Given shards: A, B, C and D, this means that a migration from A -> B can now happen at the same time as a migration from C -> D as they’re mutually exclusive source and destination shards. Of course, you need four or more shards to really see the benefit of this change.
Of course, on large clusters this change could introduce a significant change in network bandwidth usage. This is due to the ability for several balancing operations to occur at once. Be sure to test your network capacity with this change.
See more about this change here: https://docs.mongodb.com/manual/release-notes/3.4/#faster-balancing
Of course, there were many other improvements to sharding and other areas in 3.4. We hope to cover more in the future. These are just some of my personal highlights.
For more information about what has changed in the new GA release, see: MongoDB 3.4 Release Notes. Also, please check out our beta release of Percona Server for MongoDB 3.4. This includes all the improvements in MongoDB 3.4 plus additional storage engines and features.