Deeper into 3.2 – Aggregation

Although Lookup is the star of MongoDB 3.2's aggregation enhancements, there are lots of other additions which make the update a real boost for the data sifting aggregation maker. The additions range from sampling to slicing and from arithmetic to arrays, so let's start with sampling.

Sample this

When you are aggregating data there are times when you want only a sample of that data. In 3.2 the $sample operator makes it easy for you, but before you rush to use it, be aware of its limitations. It is invoked like this:

db.mytable.aggregate( { $sample: { size:100 } } );  

That will return 100 randomly selected documents. The catch is that, according to the documentation, the behavior of $sample varies based on what percentage of the documents you are actually asking for. Five percent of the documents or more and whatever storage engine is in use, the database will scan the entire collection, sort it randomly in memory and then pull the required number of documents from the top of the sorted set. This means you will be eating up memory and could hit the sort memory limitations that apply to aggregation.

For smaller samples, below five percent, a random cursor or random _id index selection is used to retrieve the needed number of documents. That's much less onerous on memory consumption but can return the same document twice if, while the cursor is being read, the underlying data changes.

Before applying the $sample operator, it may be worth getting an approximation of the total number of documents in the collection so you can ensure the sample size is below 5%.

Standard Deviation

When you have a lot of numeric data, one thing you often want to know is how spread out that data is. That's why standard deviation operators are so useful, but till version 3.2 MongoDB lacked a native version of them. Now, there's not one but two standard deviation operators for the aggregation pipeline, $stdDevPop for a standard deviation for a population and $stdDevSamp for a sample of a population.

After recently extracting some historical weather data and using CSVkit to turn the data into a format mongoimport could import easily, we had records for each month reaching back half a century for max and min temperatures. What we wanted with the standard deviation for each month through the years.

mongos> db.heathrowdata.aggregate([  
                 { $group: { _id: "$month", stdDev: { $stdDevPop: "$tempmaxC"  } } } ,
                 { $sort: { "_id": 1 } } 
                 ] )
{ "_id" : 1, "stdDev" : 1.8499012177873002 } 
{ "_id" : 2, "stdDev" : 2.0668891381355117 }
{ "_id" : 3, "stdDev" : 1.8508736130508452 }
{ "_id" : 4, "stdDev" : 1.6215701093006014 }
{ "_id" : 5, "stdDev" : 1.5074067422249364 }
{ "_id" : 6, "stdDev" : 1.6717910696683473 }
{ "_id" : 7, "stdDev" : 2.0064315920469387 }
{ "_id" : 8, "stdDev" : 1.7571129543248085 }
{ "_id" : 9, "stdDev" : 1.40914201852535 }
{ "_id" : 10, "stdDev" : 1.4697372537823592 }
{ "_id" : 11, "stdDev" : 1.307233005813382 }
{ "_id" : 12, "stdDev" : 1.7551820211407254 }

Remember, of course, to apply the appropriate standard deviation operator, pop for a complete population or as much of a population as you are interested in, and sample for any smaller grouping. The latter works hand in hand with the $sample operator:

db.heathrowdata.aggregate( [  
                        { $sample: { size: 100 } },
                        { $group: { _id: "$month", stdDev: { $stdDevSamp: "$tempmaxC"  } } } , 
                        { $sort: { "_id": 1 } } 
                        ] ). 

$project yourself

One interesting change with the aggregation pipeline is that more operators which were previously only available with the $group stage are now available with the $project stage but it behaves differently. In $group the functions work on the set of values for the documents in each group so, if you were to do $avg on, say, our temperature field we used previously, you'd get the group averages. But there's no groups like that when $project is in play. Instead, these functions expect the expression to resolve to an array and the operator works on that. So basically, you can use $avg, $min, $max and $sum , along with $stdDevPop and $stdDevSamp, in $project to operate on arrays inside documents or create arrays in documents like this:

mongos> db.heathrowdata.aggregate( [  
        { $project: { avgTemp: { $avg:[ "$tempmaxC", "$tempminC" ] } } } 
        ] )
{ "_id" : ObjectId("5696328694cf37b40190fc8d"), "avgTemp" : 6.1 }
...

An aggregate tip

If you just want to play with the various aggregating operators, try this. Create a collection with a single document, then do an aggregation on that collection with literal values. So, for example, for the previous code we could have done:

mongos> db.x.insert( { } )  
WriteResult({ "nInserted" : 1 })  
mongos> db.x.aggregate( [  
             { $project: { avgTemp: { $avg:[ 10, 20, 30, 40 ] } } }  
             ] )
{ "_id" : ObjectId("5697c12a8a0dbff7ab4477b7"), "avgTemp" : 25 }
mongos>  

Just pop your aggregate operator inside the $project and away you go. Now onwards.

Array up

We were just talking about arrays, there's been more enhancements in the pipeline to let you actually work with them. The first one you should note is $isArray which can let you determine if a field (or any other expression) is an array. It's worth wrapping your array operations with it to make sure you are processing arrays before you start making use of the other array operators.

Slicing arrays up is especially useful if you need the start or end of an array and $slice is the new operator for this. It takes an array, an optional position and a number, n. The n is the number of elements we want from the array and it's positive to take from the start of the array and negative from the end.

If we have an array [ 1, 2, 3, 4, 5 ] and we do $slice: [ [ 1, 2, 3, 4, 5 ] , 2 ] } we get [ 1, 2 ], and if we do $slice: [ [ 1, 2, 3, 4, 5 ] , -2 ] } we get [ 4, 5 ]. The optional position gives control of where that slicing starts using a 0 based index; the first position of the array is position 0. So, $slice: [ [ 1, 2, 3, 4, 5 ] , 2, 2 ] } gives [ 3, 4] and $slice: [ [ 1, 2, 3, 4, 5 ] , -2, 2 ] } gives [ 4 ,5 ]. If you use a negative position though, you can't use a negative count so no backwards slicing of the middle of arrays.

The other way to cut down our arrays is to $filter them. It's easiest to look at an example:

{ 
    $filter: { 
        input:[ 1, 2, 3, 4, 5 ], 
        as:"val" , 
        cond: { $gte: [ "$$val", 3] } 
        } 
}

The filter has three properties; input, as and cond. The input defines an expression that has to resolve to an array. Here, we've made it a literal array. The as is simply defining the name of a variable. You may wonder where that variable name is to be used and the answer is in the cond property which contains a logical expression. Each value of the input array is assigned to the as variable and then tested in the cond expression.

If it passes it gets added to the array that will be returned, if it doesn't it's just dropped. Here our expression is { $gte: [ "$$val", 3] } – note the $$ to refer to an aggregation variable. If the array value passed as val is greater than or equal to 3, it'll pass the filter. It shouldn't be a surprise to see this returning [ 3, 4, 5 ] as its result.

We've sliced and filtered, now for the ultimate array cut down, $arrayElemAt which takes an array and a position and will return just one element. The position works like the $slice position so { $arrayElemAt: [ [ "a", "b", "c", "d", "e" ], 2] } will return "c" and { $arrayElemAt: [ [ "a", "b", "c", "d", "e" ], -2] } will return "d".

Enough of cutting down arrays, let's make them bigger. The $concatArrays operator can concatenate any number of arrays, taking all their contents in order to create a single larger array. That means in practice that:

 { $concatArrays: [ [ 1, 2, 3 ,4, 5], [ [ "Hello", "World" ], [ "Nice", "Day" ], [ "a", "b", "c", "d", "e" ] ] }

would give:

[ 1, 2, 3, 4, 5, [ "Hello", "World" ], [ "Nice", "Day" ], "a", "b", "c", "d", "e" ]

Number crunching

Aggregation had been lacking in quite a few calculation operators; before 3.2 it was just $add, $divide, $mod, $multiply and $subtract. With 3.2 the list of arithmetic operators extended with the addition of $sqrt, $abs, $log, $log10, $ln, $pow, $exp, $trunc, $ceil and $floor.

Wrapping up

That's a lot of additional power put into aggregation on MongoDB and it should help, especially where there are arrays in documents loaded up with useful data. Do report, or drop us a mail, if you've found any new aggregation feature particularly useful in production as we'll be coming back to the subject of aggregation in the near future. Or, of course, you could Write Stuff about aggregation, and MongoDB 3.2, for our Write Stuff programme and earn some cash and database credit for yourself in the process. All in all, the user facing features of MongoDB 3.2 indicate a lot of potential for exploiting data.