In this article I’m going to talk about partial and sparse indexes in MongoDB® and Percona Server for MongoDB®. I’ll show you how to use them, and look at cases where they can be helpful. Prior to discussing these indexes in MongoDB in detail, though, let’s talk about an issue on a relational database like MySQL®.
Consider you have a very large table in MySQL with a boolean column. Typically you created a ENUM(‘T’,’F’) field to store the boolean information or a TINYINT column to store only 1s and 0s. This is good so far. But think now what you can do if you need to run a lot of queries on the table, with a condition on the boolean field, and no other relevant conditions on other indexed columns are used to filter the examined rows.
Why not create and index on the boolean field? Well, yes, you can, but in some cases this solution will be completely useless and will introduce an overhead for the index maintenance.
Think about if you have an even distribution of true and false values in the table, in more or less a 50:50 split. In this situation, the index on the boolean column cannot be used because MySQL will prefer to do a full scan of the large table instead of selecting half of rows using the BTREE entries. We can say that a boolean field like this one has a low cardinality, and it’s not highly selective.
Consider now the case in which you don’t have an even distribution of the values, let’s say 2% of the rows contain false and the remaining 98% contain true. In such a situation, a query to select the false values will most probably use the index. The queries to select the true values won’t use the index, for the same reason we have discussed previously. In this second case the index is very useful, but only for selecting the great minority of rows. The remaining 98% of the entries in the index are completely useless. This represents a great waste of disk space and resources, because the index must be maintained for each write.
It’s not just booleans that can have this problem in relation to index usage, but any field with a low cardinality.
Note: there are several workarounds to deal with this problem, I know. For example, you can create a multi-column index using a more selective field and the boolean. Or you could design your database differently. Here, I’m illustrating the nature of the problem in order to explain a MongoDB feature in a context.
How about MongoDB? Does MongoDB have the same problem? The answer is: yes, MongoDB has the same problem. If you have a lot of documents in a collection with a boolean field or a low cardinality field, and you create an index on it, then you will have a very large index that’s not really useful. But more importantly you will have writes degradation for the index maintenance.
The only difference is that MongoDB will tend to use the index anyway, instead of doing the entire collection scan, but the execution time will be of the same magnitude as doing the COLLSCAN. In the case of very large indexes, a COLLSCAN should be preferable.
Fortunately MongoDB has an option that you can specify during index creation to define a Partial Index. Let’s see.
A partial index is an index that contains only a subset of values based on a filter rule. So, in the case of the unevenly distributed boolean field, we can create an index on it specifying that we want to consider only the false values. This way we avoid recording the remaining 98% of useless true entries. The index will be smaller, we’ll save disk and memory space, and the most frequent writes – when entering the true values – won’t initiate the index management activity. As a result, we won’t have lots of penalties during writes but we’ll have a useful index when searching the false values.
Let’s say that, when you have an uneven distribution, the most relevant searches are the ones for the minority of the values. This is in general the scenario for real applications.
Let’s see now how to create a Partial Index.
First, let’s create a collection with one million random documents. Each document contains a boolean field generated by the javascript function randomBool(). The function generates a false value in 5% of the documents, in order to have an uneven distribution. Then, test the number of false values in the collection.
|
1 |
> function randomBool() { var bool = true; var random_boolean = Math.random() >= 0.95; if(random_boolean) { bool = false }; return bool; }<br>> for (var i = 1; i <= 1000000; i++) { db.test.insert( { _id: i, name: "name"+i, flag: randomBool() } ) }<br>WriteResult({ "nInserted" : 1 })<br>> db.test.find().count()<br>1000000<br>> db.test.find( { flag: false } ).count()<br>49949 |
Create the index on the flag field and look at the index size using db.test.stats().
|
1 |
> db.test.createIndex( { flag: 1 } ) <br>{ "createdCollectionAutomatically" : false, <br>"numIndexesBefore" : 1, <br>"numIndexesAfter" : 2, <br>"ok" : 1 } <br>> db.test.stats().indexSizes <br>{ "_id_" : 13103104, "flag_1" : 4575232 } |
The index we created is 4575232 bytes.
Test some simple queries to extract the documents based on the flag value and take a look at the index usage and the execution times. (For this purpose, we use an explainable object)
|
1 |
// create the explainable object<br>> var exp = db.test.explain( "executionStats" )<br><br>// explain the complete collection scan <br>> exp.find( { } )<br>{<br> "queryPlanner" : {<br> "plannerVersion" : 1,<br> "namespace" : "test.test",<br> "indexFilterSet" : false,<br> "parsedQuery" : {<br><br> },<br> "winningPlan" : {<br> "stage" : "COLLSCAN",<br> "direction" : "forward"<br> },<br> "rejectedPlans" : [ ]<br> },<br> "executionStats" : {<br> "executionSuccess" : true,<br> "nReturned" : 1000000,<br> "executionTimeMillis" : 250,<br> "totalKeysExamined" : 0,<br> "totalDocsExamined" : 1000000,<br> "executionStages" : {<br> "stage" : "COLLSCAN",<br> "nReturned" : 1000000,<br> "executionTimeMillisEstimate" : 200,<br> "works" : 1000002,<br> "advanced" : 1000000,<br> "needTime" : 1,<br> "needYield" : 0,<br> "saveState" : 7812,<br> "restoreState" : 7812,<br> "isEOF" : 1,<br> "invalidates" : 0,<br> "direction" : "forward",<br> "docsExamined" : 1000000<br> }<br> },<br> "serverInfo" : {<br> "host" : "ip-172-30-2-181",<br> "port" : 27017,<br> "version" : "4.0.4",<br> "gitVersion" : "f288a3bdf201007f3693c58e140056adf8b04839"<br> },<br> "ok" : 1<br>}<br><br>// find the documents flag=true<br>> exp.find( { flag: true } )<br>{<br> "queryPlanner" : {<br> "plannerVersion" : 1,<br> "namespace" : "test.test",<br> "indexFilterSet" : false,<br> "parsedQuery" : {<br> "flag" : {<br> "$eq" : true<br> }<br> },<br> "winningPlan" : {<br> "stage" : "FETCH",<br> "inputStage" : {<br> "stage" : "IXSCAN",<br> "keyPattern" : {<br> "flag" : 1<br> },<br> "indexName" : "flag_1",<br> "isMultiKey" : false,<br> "multiKeyPaths" : {<br> "flag" : [ ]<br> },<br> "isUnique" : false,<br> "isSparse" : false,<br> "isPartial" : false,<br> "indexVersion" : 2,<br> "direction" : "forward",<br> "indexBounds" : {<br> "flag" : [<br> "[true, true]"<br> ]<br> }<br> }<br> },<br> "rejectedPlans" : [ ]<br> },<br> "executionStats" : {<br> "executionSuccess" : true,<br> "nReturned" : 950051,<br> "executionTimeMillis" : 1028,<br> "totalKeysExamined" : 950051,<br> "totalDocsExamined" : 950051,<br> "executionStages" : {<br> "stage" : "FETCH",<br> "nReturned" : 950051,<br> "executionTimeMillisEstimate" : 990,<br> "works" : 950052,<br> "advanced" : 950051,<br> "needTime" : 0,<br> "needYield" : 0,<br> "saveState" : 7422,<br> "restoreState" : 7422,<br> "isEOF" : 1,<br> "invalidates" : 0,<br> "docsExamined" : 950051,<br> "alreadyHasObj" : 0,<br> "inputStage" : {<br> "stage" : "IXSCAN",<br> "nReturned" : 950051,<br> "executionTimeMillisEstimate" : 350,<br> "works" : 950052,<br> "advanced" : 950051,<br> "needTime" : 0,<br> "needYield" : 0,<br> "saveState" : 7422,<br> "restoreState" : 7422,<br> "isEOF" : 1,<br> "invalidates" : 0,<br> "keyPattern" : {<br> "flag" : 1<br> },<br> "indexName" : "flag_1",<br> "isMultiKey" : false,<br> "multiKeyPaths" : {<br> "flag" : [ ]<br> },<br> "isUnique" : false,<br> "isSparse" : false,<br> "isPartial" : false,<br> "indexVersion" : 2,<br> "direction" : "forward",<br> "indexBounds" : {<br> "flag" : [<br> "[true, true]"<br> ]<br> },<br> "keysExamined" : 950051,<br> "seeks" : 1,<br> "dupsTested" : 0,<br> "dupsDropped" : 0,<br> "seenInvalidated" : 0<br> }<br> }<br> },<br> "serverInfo" : {<br> "host" : "ip-172-30-2-181",<br> "port" : 27017,<br> "version" : "4.0.4",<br> "gitVersion" : "f288a3bdf201007f3693c58e140056adf8b04839"<br> },<br> "ok" : 1<br>}<br><br>// find the documents with flag=false<br>> exp.find( { flag: false } )<br>{<br> "queryPlanner" : {<br> "plannerVersion" : 1,<br> "namespace" : "test.test",<br> "indexFilterSet" : false,<br> "parsedQuery" : {<br> "flag" : {<br> "$eq" : false<br> }<br> },<br> "winningPlan" : {<br> "stage" : "FETCH",<br> "inputStage" : {<br> "stage" : "IXSCAN",<br> "keyPattern" : {<br> "flag" : 1<br> },<br> "indexName" : "flag_1",<br> "isMultiKey" : false,<br> "multiKeyPaths" : {<br> "flag" : [ ]<br> },<br> "isUnique" : false,<br> "isSparse" : false,<br> "isPartial" : false,<br> "indexVersion" : 2,<br> "direction" : "forward",<br> "indexBounds" : {<br> "flag" : [<br> "[false, false]"<br> ]<br> }<br> }<br> },<br> "rejectedPlans" : [ ]<br> },<br> "executionStats" : {<br> "executionSuccess" : true,<br> "nReturned" : 49949,<br> "executionTimeMillis" : 83,<br> "totalKeysExamined" : 49949,<br> "totalDocsExamined" : 49949,<br> "executionStages" : {<br> "stage" : "FETCH",<br> "nReturned" : 49949,<br> "executionTimeMillisEstimate" : 70,<br> "works" : 49950,<br> "advanced" : 49949,<br> "needTime" : 0,<br> "needYield" : 0,<br> "saveState" : 390,<br> "restoreState" : 390,<br> "isEOF" : 1,<br> "invalidates" : 0,<br> "docsExamined" : 49949,<br> "alreadyHasObj" : 0,<br> "inputStage" : {<br> "stage" : "IXSCAN",<br> "nReturned" : 49949,<br> "executionTimeMillisEstimate" : 10,<br> "works" : 49950,<br> "advanced" : 49949,<br> "needTime" : 0,<br> "needYield" : 0,<br> "saveState" : 390,<br> "restoreState" : 390,<br> "isEOF" : 1,<br> "invalidates" : 0,<br> "keyPattern" : {<br> "flag" : 1<br> },<br> "indexName" : "flag_1",<br> "isMultiKey" : false,<br> "multiKeyPaths" : {<br> "flag" : [ ]<br> },<br> "isUnique" : false,<br> "isSparse" : false,<br> "isPartial" : false,<br> "indexVersion" : 2,<br> "direction" : "forward",<br> "indexBounds" : {<br> "flag" : [<br> "[false, false]"<br> ]<br> },<br> "keysExamined" : 49949,<br> "seeks" : 1,<br> "dupsTested" : 0,<br> "dupsDropped" : 0,<br> "seenInvalidated" : 0<br> }<br> }<br> },<br> "serverInfo" : {<br> "host" : "ip-172-30-2-181",<br> "port" : 27017,<br> "version" : "4.0.4",<br> "gitVersion" : "f288a3bdf201007f3693c58e140056adf8b04839"<br> },<br> "ok" : 1<br>} |
As expected, MongoDB does a COLLSCAN when looking for db.test.find( {} ). The important thing here is that it takes 250 milliseconds for the entire collection scan.
In both the other cases – find ({flag:true}) and find({flag:false}) – MongoDB uses the index. But let’s have a look at the execution times:
Now, create the partial index on the flag field. To do it we must use the PartialFilterExpression option on the createIndex command.
|
1 |
// drop the existing index<br>> db.test.dropIndex( { flag: 1} )<br>{ "nIndexesWas" : 2, "ok" : 1 }<br><br>// create the partial index only on the false values<br>> db.test.createIndex( { flag : 1 }, { partialFilterExpression : { flag: false } } )<br>{<br> "createdCollectionAutomatically" : false,<br> "numIndexesBefore" : 1,<br> "numIndexesAfter" : 2,<br> "ok" : 1<br>}<br><br>// get the index size <br>> db.test.stats().indexSizes<br>{ "_id_" : 13103104, "flag_1" : 278528 }<br><br>// create the explainalbe object<br>> var exp = db.test.explain( "executionStats" )<br><br>// test the query for flag=false<br>> exp.find({ flag: false })<br>{<br> "queryPlanner" : {<br> "plannerVersion" : 1,<br> "namespace" : "test.test",<br> "indexFilterSet" : false,<br> "parsedQuery" : {<br> "flag" : {<br> "$eq" : false<br> }<br> },<br> "winningPlan" : {<br> "stage" : "FETCH",<br> "inputStage" : {<br> "stage" : "IXSCAN",<br> "keyPattern" : {<br> "flag" : 1<br> },<br> "indexName" : "flag_1",<br> "isMultiKey" : false,<br> "multiKeyPaths" : {<br> "flag" : [ ]<br> },<br> "isUnique" : false,<br> "isSparse" : false,<br> "isPartial" : true,<br> "indexVersion" : 2,<br> "direction" : "forward",<br> "indexBounds" : {<br> "flag" : [<br> "[false, false]"<br> ]<br> }<br> }<br> },<br> "rejectedPlans" : [ ]<br> },<br> "executionStats" : {<br> "executionSuccess" : true,<br> "nReturned" : 49949,<br> "executionTimeMillis" : 80,<br> "totalKeysExamined" : 49949,<br> "totalDocsExamined" : 49949,<br> "executionStages" : {<br> "stage" : "FETCH",<br> "nReturned" : 49949,<br> "executionTimeMillisEstimate" : 80,<br> "works" : 49950,<br> "advanced" : 49949,<br> "needTime" : 0,<br> "needYield" : 0,<br> "saveState" : 390,<br> "restoreState" : 390,<br> "isEOF" : 1,<br> "invalidates" : 0,<br> "docsExamined" : 49949,<br> "alreadyHasObj" : 0,<br> "inputStage" : {<br> "stage" : "IXSCAN",<br> "nReturned" : 49949,<br> "executionTimeMillisEstimate" : 40,<br> "works" : 49950,<br> "advanced" : 49949,<br> "needTime" : 0,<br> "needYield" : 0,<br> "saveState" : 390,<br> "restoreState" : 390,<br> "isEOF" : 1,<br> "invalidates" : 0,<br> "keyPattern" : {<br> "flag" : 1<br> },<br> "indexName" : "flag_1",<br> "isMultiKey" : false,<br> "multiKeyPaths" : {<br> "flag" : [ ]<br> },<br> "isUnique" : false,<br> "isSparse" : false,<br> "isPartial" : true,<br> "indexVersion" : 2,<br> "direction" : "forward",<br> "indexBounds" : {<br> "flag" : [<br> "[false, false]"<br> ]<br> },<br> "keysExamined" : 49949,<br> "seeks" : 1,<br> "dupsTested" : 0,<br> "dupsDropped" : 0,<br> "seenInvalidated" : 0<br> }<br> }<br> },<br> "serverInfo" : {<br> "host" : "ip-172-30-2-181",<br> "port" : 27017,<br> "version" : "4.0.4",<br> "gitVersion" : "f288a3bdf201007f3693c58e140056adf8b04839"<br> },<br> "ok" : 1<br>}<br><br>// test the query for flag=true<br>> exp.find({ flag: true })<br>{<br> "queryPlanner" : {<br> "plannerVersion" : 1,<br> "namespace" : "test.test",<br> "indexFilterSet" : false,<br> "parsedQuery" : {<br> "flag" : {<br> "$eq" : true<br> }<br> },<br> "winningPlan" : {<br> "stage" : "COLLSCAN",<br> "filter" : {<br> "flag" : {<br> "$eq" : true<br> }<br> },<br> "direction" : "forward"<br> },<br> "rejectedPlans" : [ ]<br> },<br> "executionStats" : {<br> "executionSuccess" : true,<br> "nReturned" : 950051,<br> "executionTimeMillis" : 377,<br> "totalKeysExamined" : 0,<br> "totalDocsExamined" : 1000000,<br> "executionStages" : {<br> "stage" : "COLLSCAN",<br> "filter" : {<br> "flag" : {<br> "$eq" : true<br> }<br> },<br> "nReturned" : 950051,<br> "executionTimeMillisEstimate" : 210,<br> "works" : 1000002,<br> "advanced" : 950051,<br> "needTime" : 49950,<br> "needYield" : 0,<br> "saveState" : 7812,<br> "restoreState" : 7812,<br> "isEOF" : 1,<br> "invalidates" : 0,<br> "direction" : "forward",<br> "docsExamined" : 1000000<br> }<br> },<br> "serverInfo" : {<br> "host" : "ip-172-30-2-181",<br> "port" : 27017,<br> "version" : "4.0.4",<br> "gitVersion" : "f288a3bdf201007f3693c58e140056adf8b04839"<br> },<br> "ok" : 1<br>} |
We can notice the following:
You can use the partialFilterExpression option even in compound indexes or other index types. Let’s see an example of a compound index.
Insert some documents in the students collection
|
1 |
db.students.insert( [<br>{ _id:1, name: "John", class: "Math", grade: 10 }, <br>{ _id: 2, name: "Peter", class: "English", grade: 6 }, <br>{ _id: 3, name: "Maria" , class: "Geography", grade: 8 }, <br>{ _id: 4, name: "Alex" , class: "Geography", grade: 5},<br>{ _id: 5, name: "George" , class: "Math", grade: 7 },<br>{ _id: 6, name: "Tony" , class: "English", grade: 9 },<br>{ _id: 7, name: "Sam" , class: "Math", grade: 6 },<br>{ _id: 8, name: "Tom" , class: "English", grade: 5 }<br>]) |
Create a partial compound index on name and class fields for the grade greater or equal to 8.
|
1 |
> db.students.createIndex( { name: 1, class: 1 }, { partialFilterExpression: { grade: { $gte: 8} } } )<br>{<br> "createdCollectionAutomatically" : false,<br> "numIndexesBefore" : 1,<br> "numIndexesAfter" : 2,<br> "ok" : 1<br>} |
Notice that the grade field doesn’t necessarily need to be part of the index.
Using the students collection, we want now to show when a partial index can be used.
The important thing to remember is that a partial index is “partial”. It means that it doesn’t contain all the entries.
In order for MongoDB to use it the conditions in the query must include an expression on the filter field and the selected documents must be a subset of the index.
Let’s see some examples.
The following query can use the index because we are selecting a subset of the partial index.
|
1 |
> db.students.find({name:"Tony", grade:{$gt:8}})<br>{ "_id" : 6, "name" : "Tony", "class" : "English", "grade" : 9 }<br><br>// let's look at the explain<br>> db.students.find({name:"Tony", grade:{$gt:8}}).explain()<br>{<br> "queryPlanner" : {<br> "plannerVersion" : 1,<br> "namespace" : "test.students",<br> "indexFilterSet" : false,<br> "parsedQuery" : {<br> "$and" : [<br> {<br> "name" : {<br> "$eq" : "Tony"<br> }<br> },<br> {<br> "grade" : {<br> "$gt" : 8<br> }<br> }<br> ]<br> },<br> "winningPlan" : {<br> "stage" : "FETCH",<br> "filter" : {<br> "grade" : {<br> "$gt" : 8<br> }<br> },<br> "inputStage" : {<br> "stage" : "IXSCAN",<br> "keyPattern" : {<br> "name" : 1,<br> "class" : 1<br> },<br> "indexName" : "name_1_class_1",<br> "isMultiKey" : false,<br> "multiKeyPaths" : {<br> "name" : [ ],<br> "class" : [ ]<br> },<br> "isUnique" : false,<br> "isSparse" : false,<br> "isPartial" : true,<br> "indexVersion" : 2,<br> "direction" : "forward",<br> "indexBounds" : {<br> "name" : [<br> "["Tony", "Tony"]"<br> ],<br> "class" : [<br> "[MinKey, MaxKey]"<br> ]<br> }<br> }<br> },<br> "rejectedPlans" : [ ]<br> },<br> "serverInfo" : {<br> "host" : "ip-172-30-2-181",<br> "port" : 27017,<br> "version" : "4.0.4",<br> "gitVersion" : "f288a3bdf201007f3693c58e140056adf8b04839"<br> },<br> "ok" : 1<br>} |
The following query cannot use the index because the condition on grade > 5 is not selecting a subset of the partial index. So the COLLSCAN is needed.
|
1 |
> db.students.find({name:"Tony", grade:{$gt:5}}).explain()<br>{<br> "queryPlanner" : {<br> "plannerVersion" : 1,<br> "namespace" : "test.students",<br> "indexFilterSet" : false,<br> "parsedQuery" : {<br> "$and" : [<br> {<br> "name" : {<br> "$eq" : "Tony"<br> }<br> },<br> {<br> "grade" : {<br> "$gt" : 5<br> }<br> }<br> ]<br> },<br> "winningPlan" : {<br> "stage" : "COLLSCAN",<br> "filter" : {<br> "$and" : [<br> {<br> "name" : {<br> "$eq" : "Tony"<br> }<br> },<br> {<br> "grade" : {<br> "$gt" : 5<br> }<br> }<br> ]<br> },<br> "direction" : "forward"<br> },<br> "rejectedPlans" : [ ]<br> },<br> "serverInfo" : {<br> "host" : "ip-172-30-2-181",<br> "port" : 27017,<br> "version" : "4.0.4",<br> "gitVersion" : "f288a3bdf201007f3693c58e140056adf8b04839"<br> },<br> "ok" : 1<br>} |
Even the following query cannot use the index. As we said the grade field is not part of the index. The simple condition on grade is not sufficient.
|
1 |
> db.students.find({grade:{$gt:8}}).explain()<br>{<br> "queryPlanner" : {<br> "plannerVersion" : 1,<br> "namespace" : "test.students",<br> "indexFilterSet" : false,<br> "parsedQuery" : {<br> "grade" : {<br> "$gt" : 8<br> }<br> },<br> "winningPlan" : {<br> "stage" : "COLLSCAN",<br> "filter" : {<br> "grade" : {<br> "$gt" : 8<br> }<br> },<br> "direction" : "forward"<br> },<br> "rejectedPlans" : [ ]<br> },<br> "serverInfo" : {<br> "host" : "ip-172-30-2-181",<br> "port" : 27017,<br> "version" : "4.0.4",<br> "gitVersion" : "f288a3bdf201007f3693c58e140056adf8b04839"<br> },<br> "ok" : 1<br>} |
A sparse index is an index that contains entries only for the documents that have the indexed field.
Since MongoDB is a schemaless database, not all the documents in a collection will necessarily contain the same fields. So we have two options when creating an index:
We call it “sparse” because it doesn’t contain all the documents of the collection.
The main advantage of the sparse option is to reduce the index size.
Here’s how to create a sparse index:
|
1 |
db.people.createIndex( { city: 1 }, { sparse: true } ) |
Sparse indexes are a subset of partial indexes. In fact you can emulate a sparse index using the following definition of a partial.
|
1 |
db.people.createIndex(<br>{city: 1},<br>{ partialFilterExpression: {city: {$exists: true} } }<br>)<br> |
For this reason partial indexes are preferred over sparse indexes.
Partial indexing is a great feature in MongoDB. You should consider using it to achieve the following advantages:
You are strongly encouraged to consider partial indexes if you have one or more of these use cases:
Partial indexes: https://www.mongodb.com/docs/manual/core/index-partial/
Sparse indexes: https://www.mongodb.com/docs/manual/core/index-sparse/
Articles on query optimization and investigation:
—
Photo by Mike Greer from Pexels