Storing time-series data with MongoDB and TokuMX

May 26, 2015

Author

Stephane Combaudon

MongoDB

Share this Post:

Storing time-series data is a frequent pattern for databases – be it for logs or for any kind of monitoring. Such data has the following properties: records are inserted but also never updated, the insertion rate can be high and records are likely to expire after some time. MongoDB and TokuMX are both good fits because of their flexible schema feature. But how can we handle data expiration efficiently? Several options are available: capped collections, TTL collections and partitioning (TokuMX only), but they all have different features and performance profiles.

Summary

- Capped collections: very good insert performance, but not eligible for sharding and hard to predict when documents will expire.
- TTL collections: expiration date is easy to enforce and compatible with sharding, but purge or records is inefficient.
- Partitioning: very good performance for inserts and purge, but not compatible with sharding and only available with TokuMX.

Capped collections

To create an app_stats capped collection with a size of 1GB, use the following command:

> db.createCollection( "app_stats", { capped: true, size: 1024*1024*1024 } )

1	> db.createCollection( "app_stats", { capped: true, size: 102410241024 } )

MongoDB will write to this collections in a circular fashion: once allocated files are full, data at the beginning of the first file is being overwritten. This is very good to make sure your collection will never exceed the size you set. However predicting the size you will need to store 6 months of data can be tricky.

Another benefit of capped collection is that they keep data in insertion order. So you don’t need to add an extra index to sort data by insertion date: this is a good point for good write performance.

Can you update records in a capped collection? Yes but only as long as they don’t increase the original size of the document, but it is recommended to be light on updates as you might experience strange errors with secondaries in some rare cases.

Another limitation is that sharding is not supported.

TTL collections

TTL collections take a different approach: they are normal, but they have an index that has a special option. For instance:

> db.logs.createIndex({insertDate:1}, {expireAfterSeconds: 86400})

1	> db.logs.createIndex({insertDate:1}, {expireAfterSeconds: 86400})

This index will make sure that records will be expired automatically after one day. Pretty neat!

And as a TTL collection is a regular collection, you can shard it if you need.

However the main limitation is the purge process: every minute, a background thread will look for documents that are expired and if it finds some, it will remove them. This is not really different from the application running a cron job every minute to remove old documents: it adds a constant purge workload that can be detrimental to insertion performance.

Note that TTL indexes do not work with capped collections.

Partitioning (TokuMX)

With TokuMX you can partition your data like you would with MySQL. For instance, if you want to create a collection partitioned on an insertDate field, you would use these commands:

> db.createCollection('stats_part',{primaryKey:{insertDate:1,_id:1}, partitioned: true})
> db.stats_part.addPartition({insertDate:ISODate("2015-05-31T23:59:59")})
> db.stats_part.addPartition({insertDate:ISODate("2015-06-30T23:59:59")})
[...]

1

2

3

4

> db.createCollection('stats_part',{primaryKey:{insertDate:1,_id:1}, partitioned: true})

> db.stats_part.addPartition({insertDate:ISODate("2015-05-31T23:59:59")})

> db.stats_part.addPartition({insertDate:ISODate("2015-06-30T23:59:59")})

[...]

The main benefit of partitioning is that expiring data is extremely simple and fast: remove the corresponding partition(s) with the dropPartition() function. Another nice property is that you can be very flexible regarding how large your partitions can grow. This could deserve a whole blog post, but let me give a quick example.

Suppose you want to expire data after 6 months. With regular MongoDB, indexes are using B-Trees so insertion is only fast if indexes fit in memory. So if you could create a partitioned collection, insertions would only be fast if the partition you are writing to (the last one) would fit in memory. Then you would probably have to create a partition for every day. With TokuMX, there is no such limitation and insertion performance will be as good if you have larger partitions like one partition per week or one partition per month.

Any drawback of partitioning? It involves some application overhead as you need to write the logic to periodically drop old partitions and create new ones. And sharding only has a limited support.

Conclusion

There is no one-size-fits-all solution when it comes to storing time-series data in MongoDB and TokuMX, but several options with their own benefits and trade-offs. Note that if your collection needs to be sharded, a TTL index is probably the only option. And if you are using TokuMX, partitioning is something to look at if you were planning a capped collection.