Aggregations in MongoDB by Example

Published

In this second half of MongoDB by Example, we'll explore the MongoDB aggregation pipeline. The first half of this series covered MongoDB Validations by Example.

When it's time to gather metrics from MongoDB, there's no better tool than MongoDB aggregations. Aggregations are a set of functions that allow you to manipulate the data being returned from a MongoDB query, and in this article, we'll explore MongoDB aggregations by demonstrating a few. In particular, we'll take a look at how to create basic data transformations using aggregations, and then explore how to create more complex queries by chaining multiple transformations together. Finally, we'll demonstrate how to use those transformations to extract insights from our data.

Starting an Aggregation Pipeline

Getting an aggregation pipeline started is a simple affair: simply call the aggregate function on any collection. Let's start by using our customers entity from the previous article on validations:

{
  "id": "1",
  "firstName": "Jane",
  "lastName": "Doe",
  "phoneNumber": "555-555-1212",
  "city": "Beverly Hills",
  "state: "CA",
  "zip": 90210
  "email": "Jane.Doe@compose.io"
}

Our customers collection contains any number of customers, each with a similar data format. To begin an aggregation on the customers collection, we call the aggregate function on the customers collection:

> db.customers.aggregate([ ... aggregation steps go here ...]);

The aggregate function accepts an array of data transformations which are applied to the data in the order they're defined. This makes aggregation a lot like other data flow pipelines: the transformations that are defined first will be executed first and the result will be used by the next transformation in the sequence.

In the next sections, we'll explore the types of data transformations you can perform in an order that's typical of how you'll want to perform them. The order in which you perform transformations can make a big difference since you'll want to filter out any collections you don't want to manipulate before you do any complex or intensive operations on them.

Matching Documents

The first stage of the pipeline is matching, and that allows us to filter out documents so that we're only manipulating the documents we care about. The matching expression looks and acts much like the MongoDB find function or a SQL WHERE clause. To find all users that live in Beverly Hills (or more specifically, the 90210 area code), we'll add a match stage to our aggregation pipeline:

> db.customers.aggregate([ 
  { $match: { "zip": 90210 }}
]);

This will return the array of customers that live in the 90210 zip code. Using the match stage in this way is no different from using the find method on a collection. Let's see what kind of insights we can gather by adding some stages to the pipeline.

Grouping Documents

Once we've filtered out the documents we don't want, we can start grouping together the ones that we do into useful subsets. We can also use groups to perform operations across a common field in all documents, such as calculating the sum of a set of transactions and counting documents.

Before we dive into more complex operations, let's start with something simple: counting the documents we matched in the previous section:

> db.customers.aggregate([ 
  { $match: {"zip": "90210"}}, 
  { 
    $group: {
      _id: null, 
      count: {
        $sum: 1
      }
    }
  }
]);

The $group transformation allows us to group documents together and performs transformations or operations across all of those grouped documents. In this case, we're creating a new field in the results called count which adds 1 to a running sum for every document. The _id field is required for grouping and would normally contain fields from each document that we'd like to preserve (ie: phoneNumber). Since we're just looking for the count of every document, we can make it null here.

{ "_id" : null, "count" : 24 }

Here we saw the use of the $sum arithmetic operator, which sums a field in all of the documents in a collection. We can group our documents together on any fields we'd like and perform other types of computations as well. Let's take a look at some of the other operators we can use and how we can use them.

Gaining Insights with Sum, Min, Max, and Avg

Transactions are a great place to stretch our proverbial legs when it comes to mathematical operations. Let's analyze our transactions and see if we can gain some insights into our product catalog.

Our transaction model looks like this:

{
  "id": "1",
  "productId": "1",
  "customerId": "1",
  "amount": 20.00,
  "transactionDate": ISODate("2017-02-23T15:25:56.314Z")
}

Let's start by calculating the total amount of sales made for the month of January. We'll start by matching only transactions that occurred between January 1 and January 31.

{
   $match: {
    transactionDate: {
        $gte: ISODate("2017-01-01T00:00:00.000Z"),
        $lt: ISODate("2017-02-01T00:00:00.000Z")
    }
  }
}

The $gte and $lt operators are comparators that allow us to limit our dates to specific range. In our case, we want the transactionDate to be greater than or equal to the first day in January and less than the first day in February. Notice also that we're using Zulu (UTC) time here; you'll want to make sure your transactions are inserted in the proper timezone for this to work, but for this example we'll assume that all timezones are input in Zulu time.

The next stage of the pipeline is summing the transaction amounts and putting that amount in a new field called total:

{
  $group: {
    _id: null, 
    total: {
      $sum: "$amount"
    }
  }
}

The final query looks something like this:

> db.transactions.aggregate([
  { 
    $match: {
      transactionDate: {
        $gte: ISODate("2017-01-01T00:00:00.000Z"),
        $lt: ISODate("2017-01-31T23:59:59.000Z")
      }    
    }
  }, {
    $group: {
      _id: null,
      total: {
        $sum: "$amount"
      }
    }
  }
]);

The final result is a transaction amount that looks like the following:

{ _id: null, total: 20333.00 }

Some other helpful monthly metrics we might want are the average price of each transaction, and the minimum and maximum transaction in the month. Let's add those to our group so we can get a single picture of the entire month:

Combined with the $match statement it looks like the following:

> db.transactions.aggregate([
  { 
    $match: {
      transactionDate: {
        $gte: ISODate("2017-01-01T00:00:00.000Z"),
        $lt: ISODate("2017-01-31T23:59:59.000Z")
      }    
    }
  }, {
    $group: {
      _id: null,
      total: {
        $sum: "$amount"
      },
      average_transaction_amount: {
        $avg: "$amount"
      },
      min_transaction_amount: {
        $min: "$amount"
      },
      max_transaction_amount: {
        $max: "$amount"
      }
    }
  }
]);

Our final result gives us an interesting picture of what monthly sales looked like in our fictitious cookie shop:

{ 
  _id: null, 
  total: 20333.00, 
  average_transaction_amount: 8.50,
  min_transaction_amount: 2.99,
  max_transaction_amount: 347.22
}

Using these calculations we can see that more than half of our transactions are below $10, and it's likely that our average transaction amount is being skewed by a few outliers (bulk cookie orders). We could even take this a step further by calculating the standard deviation across all transactions using stdDevPop or modify our groupings to exclude outliers by changing our $match conditions.

Wrapping Up

Now that we know how to use the aggregations pipeline we can start to combine groups, operations, and even other matching parameters to gain deep and detailed insights into any business. While there are plenty of other operations we can perform that weren't covered in this article, it should provide a good launching point for anyone interesting in analyzing data stored in MongoDB.


If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com. We're happy to hear from you.

Image by: Carli Jeen
John O'Connor
John O'Connor is a code junky, educator, and amateur dad that loves letting the smoke out of gadgets, turning caffeine into code, and writing about it all. Love this article? Head over to John O'Connor’s author page to keep reading.

Conquer the Data Layer

Spend your time developing apps, not managing databases.