We just released version 1.4.0 of TokuMX, our high-performance distribution of MongoDB. There are a lot of improvements in this version (release notes), the most of any release yet. In this series of blog posts, we describe the most interesting changes and how they’ll affect users.

MongoDB doesn’t have a “primary key” the way SQL databases do. In vanilla MongoDB, all documents are stored in arbitrary order in a heap, and the “_id” index and all secondary indexes point into that heap. In TokuMX, the documents are clustered with the _id index, and all secondary indexes store a copy of the _id for each document so they can look up the full document in the _id index for non-covering queries.

In this way, TokuMX collections sort of have a primary key, but it’s always the _id index. It’s just a clustering index that non-clustering secondary indexes use to reference the full document.

In TokuMX 1.4.0, we are introducing a new feature that makes the primary key user-definable, so it doesn’t always have to be the index on {_id: 1}. By setting the primaryKey field of a collection create command, you can define the primary key to any compound key.

To ensure uniqueness, we require that the end of the key is “_id: 1”, and we automatically create an additional unique, non-clustering index on {_id: 1}. This way, the primary key will be clustering and unique, but we’ll only do the unique checks on the non-clustering _id index, where it will be inexpensive if you allow TokuMX to auto-generate values for _id.

This will define the default sort order (“$natural” order) of the collection, and essentially lets you have a clustering key without always having a second clustering index on the _id, if you don’t need it and want to save on storage costs. Keep in mind, though, that the primary key for each document appears as a reference in every non-clustering secondary index, so if you insert documents with large fields in the primary key, that will make all your secondary indexes bigger. Also, you won’t be able to save any documents where the fields in the primary key are arrays or regexes.

To see it in action, simply run a command like

and you can see it work:

Later this week, I’ll explain why we added this feature and what else it’s used for in TokuMX 1.4.

Want to check out the newest version of TokuMX?  Download TokuMX 1.4.0 here:

MongoDB Download MongoDB Download

 

4 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
dorian

So the documents are stored sorted by primary-key and it will be very-fast-good for collections with only-primary-keys(no secondary indexes) ?

Since you use compression, and documents are compressed in blocks, do you store the index separate from the document_blocks, or do you know for each block the first/last document_id and then decompress the block on SELECT queries to see if the documents we want are inside?

dorian

cassandra saw the same thing, you increase performance by compressing because you write/read less data,

but looks like “_id” must be unique and you have +1 index here
“”” and we automatically create an additional unique, non-clustering index on {_id: 1} “””?