Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Contact Us

MongoDB 3.6 Aggregation and Array Improvements

December 27, 2017

Author

David Murphy

Insight for DBAs

Insight for Developers

MongoDB

Share this Post:

In this blog, we’ll look at MongoDB 3.6 aggregation and array improvements.

With MongoDB 3.6 out, and us having talked about security, transactions, and sessions, we can now start to talk about something much more fun. This blog is centered on the improvements around arrays, dates and aggregations. My goal here is two parts: first to say what these are, and then also provide a situation where it might be useful. As with most things you should consider setting db.adminCommand( { setFeatureCompatibilityVersion: "3.6" } ) before using these features. First, a warning: if you do this and use some features, downgrading back to 3.4 is a pain. As such only use these features, sessions or security once you know your upgrade was stable.

So the options in no particular order are:

- Diagnostic
  - - $currentOp
  - - $listSessions
  - - $listLocalSessions

- Arrays
  - - $arrayToObject
  - - $objectToArray
  - - $mergeObjects

- Date
  - - $dateFromString
  - - $dateFromParts
  - - $dateToParts

- General
  - - db.aggregate helper compared to db.collection.aggregate
  - - $hint
  - - $comment
  - - $$REMOVE inside conditional projection

- Non-Aggregation Arrays
  - - arrayFilters
  - - multi-element Array Updates
  - - push’s $position can use negative values

There’s a lot there, but don’t worry! The big ones are the Non-Aggregation Arrays, $hint, Date, and Array. Percona is here to help, and I will go deeper into each area of these changes by giving you some examples.

$hint

This might seem minor, but as MongoDB moves more and more into aggregations as the default query system for most rich actions (but not point queries), it becomes critical. In previous versions, you couldn’t even explain an aggregation. Modern versions changed this.

The next logical step is allowing $hint, which is the act of informing the optimizer to use a specific index pattern to solve the query. You might ask, “Shouldn’t it just make the right decision?” Well yes, but all databases need this type of functionality because, depending on the query pattern that gets cached, it might have selected one query even if a query of the same pattern would benefit more from another one. Please note $graphLookup and $lookup are not able to use $hint.

Imagine you have a query from a given user looking for orders between two dates for a given category like “Computers.” Logically, the index might be on category and user, then from those results, you filter out based on the date. What if I told you some users were resellers, and Walmart was one of those “users”? In that case, category and date followed lastly by the user would be far more efficient.

An index’s job is to filter out as much data as possible to reduce the computation for any filter, not in the index. In this example, Walmart would have far more products in other orders than it would in computers. Even inside of that data, it would help reduce the millions of possible records.

Date

There are three main new date functions inside of MongoDB 3.6: $dateFromString, $dateFromParts and $dateToParts. But what do these mean? Let’s explore what is “parts” vs. “string.” A part is a grouping of typical time fields. In fact, it included many of the typical fields:

- year

- month

- day

- hour

- minute

- second

- milliseconds

- timezone

If you think about it, these are possible parts a string could have when we are processing from a string instead of parts. So these functions are designed to help build an ISODate object from either a complex document of the parts, or parsing of string date formats you might use in other code places. For example, datetime.strptime in Python or strtotime in PHP. This covers the $dateFromXXXX options. However, in some codebases, you don’t really want this Date object in MongoDB to need to be marshaled by your application to be able to construct an internal date to your code.

Please note that many drivers will do this for you automatically, but maybe it could be an expensive operation. You want to be more precise. In that case, you can use $dateToParts, where it would take a date and break it into parts to be directly used by your application. Another example of this is the ability to have conditionals, where something must be true to execute. In this case, having $dateToParts $dateToParts is more efficient than a multitude of calls to all the part functions ( $year, $minute, $week, $year, and so on).

Array

Some of the new array functions include $arrayToObject, $objectToArray and $ObjectMerge. I know the last one is a bit of an outlier, but this was logically the best place to keep it, as objects are not completely dissimilar to arrays on the outside.

$arrayToObject is my least favorite, as it has some very limited behavior that will likely require you to plan your schema, or at least have another projection pipeline to prepare your array data for what you need. Some basic rules apply to this function. It expects you to pass an array of either two-element arrays or documents, where the value of the first item is the “key” in the output object, and the second value is the “value” in that output document.

If that prep was not bad enough, we must also consider if you have a “key” item twice. The output value of the item field in the resulting object will be the value for the first array entry to have matched it, not the last. I am not saying this is good or bad, as someone had to make a call on what was FIFO or FILO in the overall behavior when duplicates are found.

If you are confused, MongoDB has a great example in the manual here. The idea is that you have an inventory system with a stock field as a subdocument, where you have {warehouseName:currentStockValue} as fields and values. Assuming this, they then use $concatArrays (think array merge), so they can project a “total” entry and value into this array, which results in memory to an array like ["warehouse1"=>13,"warehouse2"=>20,"total"=>33]. This, when applied to $arrayToObject, becomes {warehouse1:13,warehouse2:20,total:33}. This is likely easier to work within your code, as it would be a native object or structure depending on the language involved.

The next one is $objectToArray, which as you might guess is the exact opposite of above. You might have something like:

{
	name: "foo"
	category:"bar"
	details:{
		size: 25
		color: red
		qty: 1
	}
}

1

2

3

4

5

6

7

8

9

{

name: "foo"

category:"bar"

details:{

size: 25

color: red

qty: 1

}

This becomes:

[
	{ "k": "name", "v": "foo"},
	{ "k": "category", "v": "bar"}
	{ "k": "details", "v": {"size":25,"color":"red","qty":1}}
]

1

2

3

4

5

[

{ "k": "name", "v": "foo"},

{ "k": "category", "v": "bar"}

{ "k": "details", "v": {"size":25,"color":"red","qty":1}}

]

If I am honest, I have trouble seeing the value of this. But I am sure in some programming contexts, this is helpful. Maybe if you wanted to “ $unwind” an object for some reason, but a use case escapes me at the time of this writing.

The final item is $mergeObjects, which does what you expect: it merges objects. This would work as you would plan to join non-duplicate keys into a single master object output. Where things are a bit weird is at $ArrayToObject. You may recall me stating the FIRST value is what is output. In $mergeObjects, the LAST value wins. As such, please be very careful when using these that you remember the order of application. As a MongoDB user, I wish it was more consistent, but there are good reasons for the behavior differences between what you want to assume with arrays vs. objects.

Non-Aggregation Arrays

As I mentioned in the intro, this can be broken into three areas: arrayFilters, multi-element updates, and negative Index usage on $push. They’re all much more useful to me than some of the aggregation improvements on arrays I just talked about.

To start, let’s talk about the big one: arrayFilters. We are almost talking about a subquery really. In many update functions, the optional arrayFilters field takes a very expressive array into it. To keep things simple, let’s say you have an application getting temperatures with a thermostat. Let’s also assume if it errors, the temp is set to 212℉, which obviously would kill someone. As a result of this value, it completely throws off the dashboards for your users.

How you deal with this is a business decision, but in my case, I decided I wanted to set them to null as the graph library would simply skip that plot point. Later points would help average the line out, preventing weird spikes or drops.

Now, let’s also assume that I keep 24 points in an array per day to reduce the number of total documents in the system. If I simply dropped the data for one point, I would only have 23, and the ranges on my axis in the graph would change, so that is a no go. We need to mark the value as null to prevent odd other behaviors.

So what can we do? Before 3.6, I needed to pull the documents down to my script/app to fix the array of elements, and then update the document fixing the issue. Obviously, this is less than ideal for many reasons but needed. Now with 3.6, we have a much better option. What commands can be used to apply this new way?

- findAndModify
  - - findOneAndUpdate()
  - - findAndModify()

- update
  - - updateOne()
  - - updateMany()
  - - update()

- bulkWrite
  - - bulkWrite – updateOne or updateMany types
  - - bulk.find.arrayFilters() for Bulk supporting only updateOne()

In my example, I will just use an update command (what people normally would use for this type of thing). So coming back to our example, let’s recap:

- Need to change any value of 212 to be null

- Must not delete an entry as it causes issues for the app

- Must not need to pull all data to app to be sent back to database

You might be surprised at how simple this is now. I just need to run the following:

db.readings.update(
	{tmp:{$gte: 212}},
	{$set:{"tmp.$[elm]"}},
	{
		multi: true,
		arrayFilters: [ { "elm": {$gte: 212}} ]
	}
)

1

2

3

4

5

6

7

8

db.readings.update(

{tmp:{$gte: 212}},

{$set:{"tmp.$[elm]"}},

{

multi: true,

arrayFilters: [ { "elm": {$gte: 212}} ]

}

)

I have kept this example simple, but you could have a good deal more logic in the arrayFilters, where you might use $and allow conditions where on a sales system you want to apply a 25% discount if something has more than 1000 copies in stock and has not gone on sale in six months. Followed by another update, using $or to remove anything from being on sale that has been on sale in the last six months OR has less than 1000 copies in stock. At any rate, surely you can see how much better this makes a multitude of tasks when you’re using arrays heavily in your design.

The next item is multiple element updates. You already saw this with arrayFilters when I used “ tmp.$[elm]”. Everything inside $[elm] was allowing for this arrayFilter just limits what documents get the change. Putting $[] would have updated all the documents, which is not what I wanted in that case, as I did not want to update the other 23 hours in the day (just the one element that was broken).

The final point is around $position inside of a push command. It used to be if I wanted to push to the end of an array, I needed to know its length to push to the end, or I could push to the start of the array. There are many cases where you want to append versus prepend the new item in an array.

However, needing to do a read first was an extra request and also opened you up to odd timing issues. Now we can just use -1, as in most programming languages, to do the same thing. However, we should note -1 means the element before the LAST one. If you want the element at the very end you would need to use an artificially high number, as any number greater than the length will result in pushing an element to the end of the array.

Diagnostic

This might be the shortest section of all, but the goal of $currentOp, $listSessions and $listSessionsLocal is to provide more information via aggregations, in the slow march toward moving more and more into that system.

While these seem simple, there is some major power in them. In $currentOp, which must be the first pipeline you have some special options. The first is allUsers, which defaults to false and thus will only show your operations, not other peoples. The other is idleConnections, which are also off by default. Once it returns these values, you can use aggregation to $bucketAuto ages of queries. Or maybe group on getMore, replication and general. Or even limit the answer to specific shards or namespaces using a $match. As you can see, it is very powerful!

Like currentOp, the session functions can also use allUsers to list for all users. They also have a users option. In the users option, you can pass an array with the object of {user: XXXX,db:XXXX} to filter the results more. However, if you are not using authentication, it will always return all sessions. As with all command aggregations, it should also be in the first slot. The significant difference is $listLocalSessions means the session this node has in memory, and $listSessions means read the sessions from the system.sessions more directly.

General

As we have already discussed $hint, I won’t cover it again. But there are some other general improvements that are interesting (but not what I would call major). The first is you can now use db.aggregate() by passing it the namespace, and you don’t have to use db.collection.aggregate(). This is not a huge change, but it does allow you to use the DB level scope.

$comment is another general improvement. If you are from the RDBMS world, you will know systems like MySQL support comments. The real power is it allows you to know when you have a slow query what part of the subsystem did the request come from. This, in turn, makes support and operations teams much more efficient at troubleshooting and resolving an issue. Please use with caution, as comments mixed with big aggregations could, in theory, exceed 16MB on the request size.

This should be VERY rare, but could still impact truncation in logs and system.profiler. As such, be as short but as specific as possible with any comments. For example, {$comment:"API::UserReq::addPrd(add_prod.py:23)"} would tell you the code path used and the file/line where this was called.

The last item to cover is about $$REMOVE in $project conditionals. This is hugely powerful, as you can tell from some outputs if you want to include or hide some field from your output. The best explanation is found in the manual here, but I will give you a quick example

db.inventory.aggregate( [
   {
      $project: {
         name: 1,
         "seller.name": 1,
         "seller.city" : 1,
         "seller.markup": {
            $cond: {
               if: { $eq: [ "", "$seller.markup" ] },
               then: "$$REMOVE",
               else: "$seller.markup"
            }
         }
      }
   }
] )

1

2

3

4

5

6

7

8

9

10

11

12