Don’t worry about embedding large arrays in your MongoDB documents

In this post, I’d like to discuss some performance problems recently mentioned about MongoDB’s embedded arrays, and how TokuMX avoids these problems and delivers more consistent performance for MongoDB applications.

In “Why shouldn’t I embed large arrays in my documents?“, Asya Kamsky of MongoDB explains why you shouldn’t embed large arrays in your MongoDB documents. It’s a great article and an in-depth study of some of the decisions made in MongoDB, and how those decisions affect its behavior and rules for effective usage.

There are three main reasons Asya gives why large embedded arrays are harmful:

  1. If the array grows frequently, it will grow the containing document, causing the document to be moved on disk instead of being rewritten. It’s common knowledge that MongoDB “document moves” are slow, because every index must be updated.

  2. If the array field is indexed, one document in the collection is responsible for a separate entry in that index for each and every element in its array. So inserting or deleting a document with a 100-element array, if that array is indexed, is like inserting or deleting 100 documents in terms of the amount of indexing work required.

  3. Asya doesn’t say this explicitly, but alludes to it: the BSON data format manipulates documents with a linear memory scan, so finding elements all the way at the end of a large array takes a long time, and most operations dealing with such a document would be slow.

TokuMX mitigates the first two of these problems with large arrays:

  1. TokuMX uses logical identifiers for documents in secondary indexes. This means that no matter how you update a document, you only incur index maintenance on the fields that are indexed. Documents in TokuMX don’t incur a “move penalty” if they grow, they simply grow in the primary key index and that’s that.

  2. TokuMX has unmatched indexed insertion speed, so it can easily handle indexing even very large arrays. Asya also mentions that all of this indexing work must be done atomically, alluding to the fact that this would block any other operations while the index maintenance is happening, but TokuMX supports concurrent reads and writes. Indexing a large array is a strange thing to do and most of the time should probably be modeled differently, but it’s nice to know that TokuMX can handle it if you need it.

TokuMX doesn’t change the BSON layout of documents in memory, so manipulations of large arrays can still be slow, but certainly no slower than the same manipulations in MongoDB. If you need to do these kinds of slow calculations, TokuMX eliminates the problematic db-level write lock, so you get better concurrency in TokuMX than MongoDB, and these problems won’t affect other clients on the system.

We think embedded arrays are a great feature of the data model that contributes to the wonderful productivity developers can achieve with MongoDB.  With TokuMX, you can use them fearlessly, stop worrying about database performance, and just make great products.


Share this post

Comments (4)

  • Mark Callaghan Reply

    How large is too large? It is kind of a let down that a far from stellar database engine in MongoDB makes it harder to use document data models.

    February 17, 2014 at 5:10 pm
    • Leif Walsh Reply

      The short answer is it’s pretty workload-dependent. These are all just multiplicative factors on top of existing pain points, there isn’t a cliff you fall off if you have one too many elements in your array.

      If you’re talking about document moves causing extra index maintenance on unindexed updates, it only really matters if you are growing the array frequently with $push or $addToSet. And there, the rate of document growth matters more than the actual number of elements in the average document at a given time. There’s a tradeoff to be made here too, you can increase the padding factor, which reduces the frequency of document moves, but increases fragmentation of data on disk (and in memory, since everything is mmapped).

      If you’re talking about indexing an array field, it depends on your application’s throughput and latency requirements. A good way to model it is by equating a single insert containing an n-element array with a single insert of a document into a collection with n indexes. For TokuMX the array is a little bit easier since all the inserts will be in the same index file instead of single inserts into n files (which would be worse for the next checkpoint), but for MongoDB it’s basically equivalent. Most guidelines I’ve found say that 10 elements in a MongoDB array might be ok, 100 is probably too much. TokuMX can probably handle 100-element indexed arrays fine, 1000 might be alright, 10000 is probably pushing it, but again, it’s just a multiplicative factor on top of your application’s throughput so YMMV.

      If you’re talking about the problem of finding things in the BSON serialized format (in particular the $push and $addToSet operators), again it’s a question of how much work do you want to get done, but 100 is probably ok, 100000 is definitely too much, and there’s a continuum in the middle. The perf tool can usually find problems with searching through long BSON arrays.

      February 17, 2014 at 6:38 pm
  • Robin Wieruch Reply

    I recently wrote a blog post about growing arrays in MongoDB. It shows some benchmarking results. Maybe it is interesting for your readers:

    October 28, 2014 at 4:52 am
    • Tim Callaghan Reply

      Thanks for the heads up. Our array “story” is about indexed arrays. Since TokuMX can easily handle secondary index maintenance, then indexing a lot of array values isn’t a problem.

      October 30, 2014 at 9:08 am

Leave a Reply