In my last post, I described a new feature in TokuMX 1.5—partitioned collections—that’s aimed at making it easier and faster to work with time series data. Feedback from that post made me realize that some users may not immediately understand the differences between partitioning a collection and sharding a collection. In this post, I hope to clear that up.
On the surface, partitioning a collection and sharding a collection seem similar. Both actions take a collection and break it into smaller pieces for some performance benefit. Also, the terms are sometimes used interchangeably when discussing other technologies. But for TokuMX, the two features are very different in purpose and implementation. In describing each feature’s purpose and implementation, I hope to clarify the differences between the two features.
Let’s address sharding first.
The purpose of sharding is to to distribute a collection across several machines (i.e. “scale-out”) so that writes and queries on the collection will be distributed. The main idea is that for big data, a single machine can only do so much. No matter how powerful your one machine is, that machine will still be limited by some resource, be it IOPS, CPU, or disk space. So, to get better performance for a collection, one can use sharding to distribute the collection across several machines, and thereby improve performance by increasing the amount of hardware.
To perform these tasks, a sharded collection ought to have a relatively even distribution across shards. Therefore, it should have the following properties:
Because of these properties, each shard contains a random subset of the collection’s data.
Now let’s address partitioning.
The purpose of partitioning is to break the collection into smaller collections so that large chunks of data may be removed very efficiently. A typical example is keeping a rolling period of 6 months of log data for a website. Another example is keeping the last 14 days of oplog data, as we do via partitioning in TokuMX 1.4. In such examples, typically only one partition (the latest one) is getting new data. Periodically, but infrequently, we drop the oldest partition to reclaim space. For the log data example, once a month we may drop a month’s worth of data. For the oplog, once a day we drop a day’s worth of data.
To perform these tasks, we are not concerned with load distribution, as nearly all writes are typically going to the last partition. We are not spreading partitions across machines. With partitioning, each partition holds a continuous range of the data (e.g. all data from the month of February), whereas with sharding, each shard holds small random chunks of data from across the key space.
With all this being said, there are still similarities when thinking of schema design with a partitioned collection and a sharded collection. As I touched on in my last post, designing a partition key has similarities to designing a shard key as far as queries are concerned. Queries on a sharded collection perform better if they target single shards. Similarly, queries on a partitioned collection perform better if they target a single partition. Queries that don’t can be thought of as “scatter/gather” for both sharded and partitioned collections.
Hopefully this illuminates the difference between a partitioned collection and a sharded collection.