Deeper into 3.2 – Aggregation
PublishedAlthough Lookup is the star of MongoDB 3.2's aggregation enhancements, there are lots of other additions which make the update a real boost for the data sifting aggregation maker. The additions range from sampling to slicing and from arithmetic to arrays, so let's start with sampling.
Sample this
When you are aggregating data there are times when you want only a sample of that data. In 3.2 the $sample
operator makes it easy for you, but before you rush to use it, be aware of its limitations. It is invoked like this:
db.mytable.aggregate( { $sample: { size:100 } } );
That will return 100 randomly selected documents. The catch is that, according to the documentation, the behavior of $sample
varies based on what percentage of the documents you are actually asking for. Five percent of the documents or more and whatever storage engine is in use, the database will scan the entire collection, sort it randomly in memory and then pull the required number of documents from the top of the sorted set. This means you will be eating up memory and could hit the sort memory limitations that apply to aggregation.
For smaller samples, below five percent, a random cursor or random _id
index selection is used to retrieve the needed number of documents. That's much less onerous on memory consumption but can return the same document twice if, while the cursor is being read, the underlying data changes.
Before applying the $sample
operator, it may be worth getting an approximation of the total number of documents in the collection so you can ensure the sample size is below 5%.
Standard Deviation
When you have a lot of numeric data, one thing you often want to know is how spread out that data is. That's why standard deviation operators are so useful, but till version 3.2 MongoDB lacked a native version of them. Now, there's not one but two standard deviation operators for the aggregation pipeline, $stdDevPop
for a standard deviation for a population and $stdDevSamp
for a sample of a population.
After recently extracting some historical weather data and using CSVkit to turn the data into a format mongoimport
could import easily, we had records for each month reaching back half a century for max and min temperatures. What we wanted with the standard deviation for each month through the years.
mongos> db.heathrowdata.aggregate([
{ $group: { _id: "$month", stdDev: { $stdDevPop: "$tempmaxC" } } } ,
{ $sort: { "_id": 1 } }
] )
{ "_id" : 1, "stdDev" : 1.8499012177873002 }
{ "_id" : 2, "stdDev" : 2.0668891381355117 }
{ "_id" : 3, "stdDev" : 1.8508736130508452 }
{ "_id" : 4, "stdDev" : 1.6215701093006014 }
{ "_id" : 5, "stdDev" : 1.5074067422249364 }
{ "_id" : 6, "stdDev" : 1.6717910696683473 }
{ "_id" : 7, "stdDev" : 2.0064315920469387 }
{ "_id" : 8, "stdDev" : 1.7571129543248085 }
{ "_id" : 9, "stdDev" : 1.40914201852535 }
{ "_id" : 10, "stdDev" : 1.4697372537823592 }
{ "_id" : 11, "stdDev" : 1.307233005813382 }
{ "_id" : 12, "stdDev" : 1.7551820211407254 }
Remember, of course, to apply the appropriate standard deviation operator, pop for a complete population or as much of a population as you are interested in, and sample for any smaller grouping. The latter works hand in hand with the $sample
operator:
db.heathrowdata.aggregate( [
{ $sample: { size: 100 } },
{ $group: { _id: "$month", stdDev: { $stdDevSamp: "$tempmaxC" } } } ,
{ $sort: { "_id": 1 } }
] ).
$project yourself
One interesting change with the aggregation pipeline is that more operators which were previously only available with the $group
stage are now available with the $project
stage but it behaves differently. In $group
the functions work on the set of values for the documents in each group so, if you were to do $avg
on, say, our temperature field we used previously, you'd get the group averages. But there's no groups like that when $project
is in play. Instead, these functions expect the expression to resolve to an array and the operator works on that. So basically, you can use $avg
, $min
, $max
and $sum
, along with $stdDevPop
and $stdDevSamp
, in $project
to operate on arrays inside documents or create arrays in documents like this:
mongos> db.heathrowdata.aggregate( [
{ $project: { avgTemp: { $avg:[ "$tempmaxC", "$tempminC" ] } } }
] )
{ "_id" : ObjectId("5696328694cf37b40190fc8d"), "avgTemp" : 6.1 }
...
An aggregate tip
If you just want to play with the various aggregating operators, try this. Create a collection with a single document, then do an aggregation on that collection with literal values. So, for example, for the previous code we could have done:
mongos> db.x.insert( { } )
WriteResult({ "nInserted" : 1 })
mongos> db.x.aggregate( [
{ $project: { avgTemp: { $avg:[ 10, 20, 30, 40 ] } } }
] )
{ "_id" : ObjectId("5697c12a8a0dbff7ab4477b7"), "avgTemp" : 25 }
mongos>
Just pop your aggregate operator inside the $project
and away you go. Now onwards.
Array up
We were just talking about arrays, there's been more enhancements in the pipeline to let you actually work with them. The first one you should note is $isArray
which can let you determine if a field (or any other expression) is an array. It's worth wrapping your array operations with it to make sure you are processing arrays before you start making use of the other array operators.
Slicing arrays up is especially useful if you need the start or end of an array and $slice
is the new operator for this. It takes an array, an optional position and a number, n. The n is the number of elements we want from the array and it's positive to take from the start of the array and negative from the end.
If we have an array [ 1, 2, 3, 4, 5 ] and we do $slice: [ [ 1, 2, 3, 4, 5 ] , 2 ] }
we get [ 1, 2 ], and if we do $slice: [ [ 1, 2, 3, 4, 5 ] , -2 ] }
we get [ 4, 5 ]. The optional position gives control of where that slicing starts using a 0 based index; the first position of the array is position 0. So, $slice: [ [ 1, 2, 3, 4, 5 ] , 2, 2 ] }
gives [ 3, 4] and $slice: [ [ 1, 2, 3, 4, 5 ] , -2, 2 ] }
gives [ 4 ,5 ]. If you use a negative position though, you can't use a negative count so no backwards slicing of the middle of arrays.
The other way to cut down our arrays is to $filter
them. It's easiest to look at an example:
{
$filter: {
input:[ 1, 2, 3, 4, 5 ],
as:"val" ,
cond: { $gte: [ "$$val", 3] }
}
}
The filter has three properties; input
, as
and cond
. The input
defines an expression that has to resolve to an array. Here, we've made it a literal array. The as
is simply defining the name of a variable. You may wonder where that variable name is to be used and the answer is in the cond
property which contains a logical expression. Each value of the input
array is assigned to the as
variable and then tested in the cond
expression.
If it passes it gets added to the array that will be returned, if it doesn't it's just dropped. Here our expression is { $gte: [ "$$val", 3] }
– note the $$
to refer to an aggregation variable. If the array value passed as val
is greater than or equal to 3, it'll pass the filter. It shouldn't be a surprise to see this returning [ 3, 4, 5 ] as its result.
We've sliced and filtered, now for the ultimate array cut down, $arrayElemAt
which takes an array and a position and will return just one element. The position works like the $slice
position so { $arrayElemAt: [ [ "a", "b", "c", "d", "e" ], 2] }
will return "c" and { $arrayElemAt: [ [ "a", "b", "c", "d", "e" ], -2] }
will return "d".
Enough of cutting down arrays, let's make them bigger. The $concatArrays
operator can concatenate any number of arrays, taking all their contents in order to create a single larger array. That means in practice that:
{ $concatArrays: [ [ 1, 2, 3 ,4, 5], [ [ "Hello", "World" ], [ "Nice", "Day" ], [ "a", "b", "c", "d", "e" ] ] }
would give:
[ 1, 2, 3, 4, 5, [ "Hello", "World" ], [ "Nice", "Day" ], "a", "b", "c", "d", "e" ]
Number crunching
Aggregation had been lacking in quite a few calculation operators; before 3.2 it was just $add
, $divide
, $mod
, $multiply
and $subtract
. With 3.2 the list of arithmetic operators extended with the addition of $sqrt
, $abs
, $log
, $log10
, $ln
, $pow
, $exp
, $trunc
, $ceil
and $floor
.
Wrapping up
That's a lot of additional power put into aggregation on MongoDB and it should help, especially where there are arrays in documents loaded up with useful data. Do report, or drop us a mail, if you've found any new aggregation feature particularly useful in production as we'll be coming back to the subject of aggregation in the near future. Or, of course, you could Write Stuff about aggregation, and MongoDB 3.2, for our Write Stuff programme and earn some cash and database credit for yourself in the process. All in all, the user facing features of MongoDB 3.2 indicate a lot of potential for exploiting data.