Aggregation in MongoDB 2.6: Things Worth Knowing

TL;DR: The powerful aggregation framework in MongoDB is even more powerful in MongoDB 2.6

The MongoDB 2.6 release improved aggregation framework (one of MongoDB's best features) considerably. We often hear from customers who are unaware of the aggregation framework, or unsure exactly why they should be using it. We frequently find them wrestling with unnecessarily complex and slow methods of solving problems that the aggregation framework is purpose built to solve. With this in mind, we'll take a moment to introduce aggregation before diving into the 2.6 changes. If you already understand the framework, feel free skip ahead; otherwise read on...

Introducing aggregation

The aggregation framework in MongoDB has become the go-to tool for a range of problems which would traditionally have been solved with the map-reduce engine. Introduced back in MongoDB 2.2, the framework distills collections down to essential information by using a multi-stage pipeline of filters, groupers, sorters, transformations and other operators. The distilled set of results is produced far more efficiently than other techniques. The set of operations is fixed, though, and does have the flexibility of map-reduce scripts. Before investing development time in map-reduce, it is best to check whether you can achieve the same results with the aggregation framework.

Step by step

We prefer to show rather than tell, so lets look at a worked example. Consider we have a simple collection of documents – first name, last name and zip code...

{
    _id: ObjectId,
    userid: 1,
    firstName: "first",
    lastName: "last",
    zip: "12345"
}

If you don't have a collection like that try this Node.js program which will make you a million documents:

var MongoClient = require('mongodb');  
var Faker = require('Faker');  
var MongoHQURL = "<Your MongoHQ URL>";

MongoClient.connect(MongoHQURL, function(err, db) {  
    if (err) throw err;
    db.collection('zips', function(err, collection) {
collection.remove(function(err) {  
    var bulkop = collection.initializeOrderedBulkOp();
    for (var i = 0; i < 1000000; i++) {
        bulkop.insert({
            userid: i,
            firstName: Faker.Name.firstName(),
            lastName: Faker.Name.lastName(),
            zip: Faker.Address.zipCode()
        });
    };
    bulkop.execute(function(err, results) {
        console.log(JSON.stringify(results));
        process.exit();
    });
});
    });
});

(The program uses the Faker.js library to mock up records. Install it with npm -g install Faker ... and the uppercase F in Faker is important). After running that you will have some data to work with.

What we want to know is how many of those records belong to the same zip code. We start the process by calling aggregate on our collection and then pass it an array that represents the pipeline...

db.zips.aggregate( [  

Now we want to use the aggregate's $group operator as the first element of the pipeline. This is the core operators of the aggregation framework as it groups documents so that the aggregate values can be calculated.

{  $group: {

For our needs, we want to group on the zip code so we want to create new documents with an _id field mapped to the zip field.

_id: "$zip",  

Now we want a count of how many times each zip code appears which we want to put in a field called usersInZip. There's a whole set of group accumulators – $max, $min, $avg, $first, $last, $addToSet, $push and the one we're interested in, $sum which, for each document matched, adds a value to the field. This value can be derived from the document so you can create totals or it can be an absolute value like 1, which effectively counts the number of documents...

usersInZip: { $sum: 1 }  
    } }
] );

If we put that together and call it from Mongo shell we quickly get the first selection of aggregated values.

> db.zips.aggregate( [ { $group: { _id: "$zip", usersInZip: { $sum: 1 } } } ] );
{ "_id" : "36305-8896", "usersInZip" : 1 }
{ "_id" : "47786-1882", "usersInZip" : 1 }
{ "_id" : "37395-6928", "usersInZip" : 1 }
{ "_id" : "63261-8198", "usersInZip" : 1 }
{ "_id" : "22501-2147", "usersInZip" : 1 }
{ "_id" : "55996-0238", "usersInZip" : 1 }
...

But we're only really interested in places where more than 10 people cite the same zip code. For this, we use the $match operator which can take as its parameters MongoDB query syntax. For our needs, we just want usersInZip to be over ten...

{ $match: { usersInZip: { $gt: 10 } } }

We add this as the next element in the pipeline array and we put that into the shell...

> db.zips.aggregate( [
... { $group: { _id: "$zip", usersInZip: { $sum: 1 } } },
... { $match: { usersInZip: { $gt: 10 } } }
... ] );
{ "_id" : "53082", "usersInZip" : 11 }
{ "_id" : "49904", "usersInZip" : 12 }
{ "_id" : "24846", "usersInZip" : 11 }
{ "_id" : "49848", "usersInZip" : 11 }
{ "_id" : "74083", "usersInZip" : 11 }
...

But ideally we'd like these in descending order so we can add another element, $sort, to the pipeline array...

>db.zips.aggregate( [
... { $group: { _id: "$zip", usersInZip: { $sum: 1 } } },
... { $match: { usersInZip: { $gt: 10 } } },
... { $sort: { usersInZip: -1 } }
... ] );
{ "_id" : "31057", "usersInZip" : 17 }
{ "_id" : "70191", "usersInZip" : 17 }
{ "_id" : "91939", "usersInZip" : 16 }
{ "_id" : "16431", "usersInZip" : 16 }
...

Other operators that can be added to the pipeline include $project which allows fields to be added, renamed, computed from values or removed as they pass through the pipeline, $skip to hop over a number of pipelined documents and $limit to limit the pipeline. More advanced operations include $unwind to expand arrays into the pipeline and $geoNear for filling the pipeline with documents based on geographical proximity to a location. It's a powerful tool which is easy to access.

2.6 powers up the pipeline

In MongoDB 2.6, the first major change is almost invisible; the aggregate method now returns cursors rather than an array of documents, which means it can return any number of results.

$out and $redact are two new types of stage added in 2.6's aggregation pipeline. $out allows the results of an aggregation pipeline to be written to a new collection. It has to be the last stage in the pipeline and takes a collection name as a parameter. If there's no collection of that name, the collection will be created. If there is an existing collection, the new results will completely replace it. The new collection only becomes available when the aggregation has successfully completed and the results don't violate any index constraints, including the _id field. This also means there's no need to worry about the new collection having an incomplete result set.

$redact strips the document stream of content based on values within the document and its sub-documents. Depending on the result of a boolean expression, a document can be pruned from the stream, be included in the stream after its sub-documents have also been checked, or just passed complete into the stream. The idea behind $redact is to make the removal of sensitive information from the stream easy.

Many of the other changes made have been in enhancing the operators for the stages. For example, there are now set operators for equality, creating sets from two sets by intersection, difference or union, testing whether a set is a subset of another and if any or all of a set's elements evaluate to true.

Variables and Explaining

Variables are now supported in aggregation pipelines and can be assigned values using $let or take a value from one of the system variables available. There's also a $map operator which can apply a function to all the elements of an array or set. There's also a fix to stop things that include a "$" being evaluated as an expression ($literal) and a way to get the size of arrays ($size).

Keeping the best till last, you can now get an aggregation explained by passing the option "explain: true" to the aggregate() call which then returns information. This addition stops aggregation being a black box and opens the door to pipeline analysis for better performance.

Conclusion and resources

If you aren't using (or haven't looked at) aggregation in MongoDB, hopefully we've whetted your appetite and set you on a path to exploit the framework's potential. You'll find MongoDB's documentation has a section which introduces the features and a quick reference page for the pipeline stages and operators. If you are coming from an SQL background, check out the SQL to aggregation chart too.