William Zola over at MongoDB gave a great talk called “The (Only) Three Reasons for Slow MongoDB Performance”. It reminded me of an interesting characteristic of updates in MongoDB. Because MongoDB’s main data store is a flat file and secondary indexes store offsets into the flat file (as I explain here), if the location of a document changes, corresponding entries in secondary indexes must also change. So, an update to an unindexed field that causes the document to move also causes modifications to every secondary index, which, as William points out, can be expensive. If a document has indexed an array, this problem is exacerbated.
What interests me about this problem is the underlying cause: the decision to make the main data store a heap and use offsets into this heap as the row identifier in secondary indexes. Other databases such as TokuMX, InnoDB, and TokuDB do not have this problem, because of the row identifier used in those systems. These systems’ main data store is a clustering index (primary key in TokuDB and InnoDB, _id index in TokuMX). So, if data gets shuffled within the clustering index, the respective row identifers used do not change.
What is basically happening here is that the index is providing a layer of abstraction for accessing the data and preventing this problem.
So this got me thinking, what are the benefits and drawbacks to these respective approaches? In doing so, I kept the following important fact in mind: MongoDB is schemaless, and therefore, we ought to expect documents of varying sizes.
The benefits I see to using a clustering index are:
Queries on large data require less I/O. If your data is large enough such that the _id index in MongoDB does not fit in memory, then a lookup by _id may require two I/Os: one to find the row identifier and another to lookup the the row. With a clustering index, only one I/O is required.
Unindexed updates that grow documents never cause secondary indexes to be modified. As mentioned above, if the clustering index needs to shuffle documents or rows around, the secondary indexes do not need to change because the row identifier does not change. For MongoDB, document movement can become painful. As a result, users try their best to avoid document movement, by pre-padding documents. But this workaround increases disk usage. As we know, disk space may be limited on SSDs. Given that one of MongoDB’s biggest strengths is the flexibility one gets with a document store, one may expect documents to grow in unpredictable ways, so pre-padding may be unsuitable, especially on SSDs.
Range queries on the clustering index are fast. Because the entire document is stored with the key, disk may be sequentially accessed, and the clustering index acts as good as a covering index. With MongoDB (and coincidentally MyISAM), point queries may still be required into the main data store to retrieve documents.
The benefits I see to not using a clustering index are the following:
In-memory non-covering range queries may be quite fast. If the query needs to touch disk, then I/O (be it sequential or random) can be the most expensive portion of your query. If the query is covering, then the heap is not touched. But if the query is in-memory and non-covering, then the performance bottleneck can be retrieving the full document from the main data store. With TokuMX, InnoDB, and TokuDB, this requires doing a search within a B-Tree or fractal tree, and that requires computation. But all MongoDB needs to do is find an offset into a file and it is done. This can be quite fast.
_id index stays in-memory longer, helping random insertions. The _id index is a unique index that requires lookups to verify uniqueness. If one is randomly inserting into the _id index, and the _id index fits in memory, this lookup is fast. But if the _id index is not in memory, then the lookup may require disk access and may be slow. Note that this only happens if the _id is populated by the user. Auto-generated _id fields (the default in MongoDB and TokuMX) have a right-most insertion pattern. By not clustering the _id index, one can keep the _id index in-memory longer, which may mean being able to store more data on a single box before being forced to shard. However, for InnoDB or TokuMX, there is a simple workaround. Let the _id index be auto-generated, and create another unique, non-clustering, secondary index that stores what would be the _id you wish to pass. This maintains the same uniqueness invariant while having the same small working set requirement of vanilla MongoDB, with this workload.
Scanning the entire collection is fast. A collection scan does not require the documents to be in any order, so MongoDB (and MyISAM in MySQL) can scan their flat file very efficiently. TokuMX needs to iterate over an index, which requires more computation cost.
So, as far as I can tell, the only real advantage to using a flat file instead of a clustering index as the main data store is the speed of in-memory non-covering secondary range queries and table scans. I see other scenarios as being “a wash”, such as non-covering out-of-memory secondary range queries, or writes, or covering range queries.
If there are other reasons why a flat file would be better than a clustering index, please leave a comment.