Tokutek is pleased to announce today’s release of TokuMX v1.5. Also worth noting is that TokuMX is exactly 1 year old tomorrow. But enough about birthdays, and more about features!
This release brings with it the ability to partition a collection in unsharded TokuMX deployments. Zardosht Kasheff, one of Tokutek’s engineers, did an excellent job covering everything you want to know about partitioned collections in these four blog posts.
What I’ll cover in the remainder of this blog is a benchmark showing how powerful partitioned collections can be for a very common use-case, time series data. Often times, the data is interesting for a certain amount of time, likely a number of days, weeks or months. When that that time limit is reached the data is no longer valuable and can be removed from the database. Consider the following schema for tracking weather information:
timestamp : 2014-06-18 15:00:00 UTC,
sensorid : 1,
temperature : 55,
humidity : 34.4
If our application required us to maintain temperature information for a year we might be tempted to use a TTL Index or write a cron job to delete temperature readings from the database that are more than a year old. Both of these solutions work just fine when you are ingesting data slowly, but what do you do if the new data is arriving at high velocity (say greater than 10,000 inserts per second, or even much higher)? TokuMX supports inserting data at rates much higher than that, especially on modern hardware.
The point of the question is this one important fact, deleting data from collections with secondary indexes is hard. It’s hard because deleting a document in TokuMX requires a look-up of the document itself in order to get the existing values of the fields making up the secondary indexes, so the entries in the secondary indexes can be removed as well. This delete behavior isn’t unique to TokuMX, as deletes induce the same read behavior in MongoDB, TokuDB, and InnoDB.
So the real issue with inserting 50,000 documents per second is that at some point you’ll need to be deleting at that same rate of 50,000 documents per second without disrupting your insert performance. If your “deleter” can’t keep up with your “inserter”, then it will fall further and further behind. And as you’ll see in a moment, it can’t keep up. Unlike my usual Tokutek product vs. Other-Vendor product benchmarks, this benchmark is only intended to show the improvement partitioning makes for TokuMX users. Plus, if your insertion performance is IO bound then your “deleter” can likely keep up with your “inserter”.
- Dell R710, (2) Xeon 5540, 48GB RAM, PERC 6/i (256MB, write-back), 8x10K SAS/RAID 10
- Ubuntu 13.10 (64-bit), XFS file system
- TokuMX v1.5.0, 1GB cache, directIO
I created a customized version of iiBench, with a single insert thread performing batched inserts of 1000 documents each. This insert thread runs at full speed, unless the collection grows to 10% over it’s targeted 20 million documents, at which point it waits for documents to be removed before resuming. Two different data removal methods are tested:
- “Deleting” : When the collection reaches a size of 20 million documents, a delete thread begins deleting data by _id in batches of 10,000, until the collection is down to 20 million documents or less.
- “Partitioning” : A partition thread watches the collection and creates a new partition every 1 million documents. When the collection reaches a size of 20 million documents, the oldest partition is dropped.
At 20 million cumulative inserts the “Deleting” approach suffers a significant performance impact, and eventually the inserts are gated by the deletion performance of around 4,000 deletes per second. The “Partitioning” approach leads to faster overall insertion performance (it’s managing smaller slices of the collection during insertion) and the data removal process has no measurable impact on the running workload.
We believe that adding support for partitioned collections to TokuMX v1.5 is a serious game-changer for many MongoDB use-cases and encourage you to give it a try. Or if you’d like to keep your data forever, we’re really good at that too. Not to mention that our compression can usually reduce your storage needs by 90%.