Mongo Metrics: Finding a Happy Median

Published

In this second entry in our new "Mongo Metrics" series, we'll take a look at using the MongoDB aggregations pipeline to compute the MEDIAN of a set of data. Following up on our previous article on calculating the MEAN, we'll take a look at how computing the MEDIAN can help us reduce the effect of outliers on the metrics we gather from our data and get a sense of how much our MEAN is misleading us.

What's Up With Median?

One of the main weaknesses with the MEAN, or average, is its susceptibility to being skewed by a few values that lie far outside of where most of the other values are. These outliers on either the high or low side of our data can cause the MEAN to skew greatly in the direction of the outlier. To get a more accurate view of our data, we'll want to reduce the impact of these outliers.

There are a few ways to compute the MEDIAN, and all of them require your data to be in a ordered set. In this article, we'll start by removing all non-conforming data (e.g. NULL or empty values) and then sorting the remaining data in ascending order. By sorting the data, our outliers will be pushed to the edges of our data set, making the center elements more representative of what a typical data point might look like, so our MEDIAN calculation will return those elements.

Since the MEDIAN calculation requires us to sort our data set, it's not going to be as fast or efficient as the MEAN calculation. Also, as is the case with many database systems, MongoDB does not have a $median operator so we'll have to compute the MEDIAN ourselves. While many developers choose to compute the MEDIAN on the application layer, the aggregation pipeline in MongoDB can be used to perform this sort of calculation in the data layer as well.

Entering the MEDIAN

Before we get started, make sure you have a foundational understanding of the $match and $group operators in the MongoDB aggregation pipeline. If you need some background, you can check out our previous article on MongoDB aggregations by example.

In the general sense, there are a few different ways to compute the MEDIAN value of an ordered data set. For our purposes, we'll compute the MEDIAN by using the following steps:

  1. Count the number of items in the data set
  2. Sort the data from smallest to largest
  3. Find the middle elements of the data using the following algorithm:
    1. Divide the count of the elements in half (n / 2).
    2. Get the floor and ceiling of the value in step 1. We'll call these the high and low position.
      • If you have an even number of elements, this will be a positive integer so the floor and ceiling will return the same value. Add the values at the floor and floor + 1 positions together and divide the result by 2 to get the median.
      • If you have an odd number of elements, the floor and ceiling will return two different values. The value at the ceiling location will be our median value.

Finding the MEDIAN

Let's take a look at an example using the transactions data from a fictitious pet supply company used in our previous article on MEAN:

order_id | date       | item_count | order_value  
------------------------------------------------
50005    | 2016-09-02 | 0          | (NULL)  
50002    | 2016-09-02 | 1          | 5.99  
50003    | 2016-09-02 | 1          | 4.99  
50010    | 2016-09-02 | 1          | 20.99  
50006    | 2016-09-02 | 1          | 5.99  
50008    | 2016-09-02 | 1          | 5.99  
50009    | 2016-09-02 | 2          | 12.98  
50007    | 2016-09-02 | 2          | 19.98  
50001    | 2016-09-02 | 2          | 7.98  
50000    | 2016-09-02 | 3          | 35.97  
50004    | 2016-09-02 | 7          | 78.93  

That data will be contained in MongoDB as an array of documents that looks like the following:

{ 
  _id: ObjectID("589cd56b6ca2eef0f7737b0a"), 
  orderId: 50001, 
  date: ISODATE("2016-09-02"), 
  itemCount: 3, 
  orderValue: 35.97 
}

Now that we have some data to work with, we can begin analyzing that data. Analyzing data starts with figuring out what questions we're trying to answer. In this case, we'd like to know what the MEDIAN order value across all of our transactions.

First, we'll need to clean up our data. Looking at our transaction table, we'll see that one of our orders contains no items and has an orderValue of NULL. In MongoDB, this corresponds to a missing order_value field. We can start our aggregation pipeline by matching only the documents that have a value in the order_value field:

  {
    $match: {
      order_value: {
        $exists: true
      }
    }
  }

Note that you can use matching to filter out all kinds of information - feel free to use multiple matches if it makes sense to filter data out of your calculations.

Now that we've filtered out the documents we don't want to include, let's count the number of transactions that we do want to include. We can do this by using the group aggregation and creating a sum calculation, adding 1 for each document we encounter:

{ 
    $group: {
      _id: null, 
      count: { 
        $sum: 1 
      }, 
      values: { 
        $push: "$order_value" 
      } 
    } 
  }

Notice the values part of the document. This allows us to create an array called "values" which contains one entry for each transaction document we encounter. In particular, it pushes the order_value field from each transaction into the values array. We'll use the values array to pass the order values along the aggregation pipeline.

Running the aggregation at this stage results in something like this:

{ 
  "_id" : null, 
  "count" : 9, 
  "values" : [ 
     5.99, 
     4.99, 
     20.99, 
     5.99, 
     5.99, 
     12.98, 
     19.98, 
     35.97, 
     78.93 
] }

Our next step is to sort the values in order, but we have a problem. MongoDB comes built-in with a very handy $sort operator that we can use to sort data. However, that $sort operator can only be used on documents in a result set, not on values inside of an array. Basically, we need to turn our current single-document result into an array of documents, one for each value. Luckily for us, MongoDB has the $unwind operator which lets us do exactly that. Using $unwind, we can take the values in an array and create one document from each of them:

  { 
    "$unwind": "$values" 
  }

Unwinding will reverse the grouping instruction by expanding our array out into multiple results, but it will also preserve our new count field by including it in every new document.

BEFORE the unwind, our data looks like the following:

{ 
  "_id" : null, 
  "count" : 9, 
  "values" : [ 
    5.99, 
    4.99, 
    20.99, 
    5.99, 
    5.99, 
    12.98, 
    19.98, 
    35.97, 
    78.93 
 ] }

After unwinding, it has the following format:

{ "_id" : null, "count" : 9, "values" : 5.99 }
{ "_id" : null, "count" : 9, "values" : 4.99 }
{ "_id" : null, "count" : 9, "values" : 20.99 }
{ "_id" : null, "count" : 9, "values" : 5.99 }
{ "_id" : null, "count" : 9, "values" : 5.99 }
{ "_id" : null, "count" : 9, "values" : 12.98 }
{ "_id" : null, "count" : 9, "values" : 19.98 }
{ "_id" : null, "count" : 9, "values" : 35.97 }
{ "_id" : null, "count" : 9, "values" : 78.93 }

This allows us to go back to using the MongoDB aggregations operators on our data.

Now that we have our data back in a usable format, let's sort the data across all documents using the values field:

  { 
    "$sort": { 
      values: 1
    } 
  }

Sorting using MongoDB aggregations is pretty straight-forward - we pass the name of the field as the key to the $sort operator, along with a positive value for a ascending sort and a negative value for a descending sort.

Whoooaaaa - We're Halfway There

At this point, we now have a count of our documents and have them in sorted order, which are the first and second steps of our MEDIAN calculation. Now, we're on to the third step which is computing the high and low midpoints, retrieving the elements there, and averaging them together.

At this stage, we're going to use a new operator that we haven't seen yet in this series: $project. The $project operator creates a projection, which is a custom data structure that can be passed down the aggregation pipeline. These projections are very versatile: they can contain data computed using aggregation operators, results from previous stages in the pipeline, and individual data fields from a collection. We'll use $project to gradually build up the information we'll need to compute our MEDIAN.

The first projection will calculate the midpoint of the data set using the $divide operator:

  {
    $project: { 
      "count": 1, 
      "values": 1, 
      "midpoint": { 
        $divide: [
          "$count", 
          2 
        ] 
      }
    }
  }

The first two keys have a value of 1, which has the effect of preserving the count and values fields from the previous step. Next, the midpoint is calculated by dividing the count by 2 using the $divide operator.

The next projection step will compute those two indices by using the $floor operator for the low-side index and the $ceil operator for the high-side index. This will return two different numbers if our original count was odd, but a single number (the MEDIAN index) if the count was even.

{
    $project: {
      "count": 1,
      "values": 1,
      "midpoint": 1,
      "high": { 
        $ceil: "$midpoint"
      },
      "low": {
        $floor: "$midpoint"
      }
    }
  }

Running the aggregation so far will give us results that look like the following:

{ "_id" : null, "count" : 9, "values" : 4.99, "midpoint" : 4.5, "high" : 5, "low" : 4 }
{ "_id" : null, "count" : 9, "values" : 5.99, "midpoint" : 4.5, "high" : 5, "low" : 4 }
{ "_id" : null, "count" : 9, "values" : 5.99, "midpoint" : 4.5, "high" : 5, "low" : 4 }
{ "_id" : null, "count" : 9, "values" : 5.99, "midpoint" : 4.5, "high" : 5, "low" : 4 }
{ "_id" : null, "count" : 9, "values" : 12.98, "midpoint" : 4.5, "high" : 5, "low" : 4 }
{ "_id" : null, "count" : 9, "values" : 19.98, "midpoint" : 4.5, "high" : 5, "low" : 4 }
{ "_id" : null, "count" : 9, "values" : 20.99, "midpoint" : 4.5, "high" : 5, "low" : 4 }
{ "_id" : null, "count" : 9, "values" : 35.97, "midpoint" : 4.5, "high" : 5, "low" : 4 }
{ "_id" : null, "count" : 9, "values" : 78.93, "midpoint" : 4.5, "high" : 5, "low" : 4 }

We now know what the midpoint, as well as the high and low indices of our MEDIAN items will be. However, the high, low, and midpoint are scattered throughout multiple documents. To move onto the next step, we'd like to condense all of these into a single document containing the high-side index, the low-side index, and the values. Since we have those items already, we no longer need to preserve the midpoint.

We can bring all of these values into a single document using the $group operator.

{ 
    $group: {
      _id: null,
      values: {
        $push: "$values"
      }, 
      high: {
        $avg: "$high"
      },
      low: {
        $avg: "$low"
      }
    }
  }

The $avg operator on the high and low values aren't meant to do anything productive at this point - it just allows us to collapse our high and low indices across all documents into a single result once the grouping is done. You can see that by taking a peek at the data after this operation is added to the pipeline:

{ 
  "_id" : null, 
  "values" : [ 
    4.99, 
    5.99, 
    5.99, 
    5.99, 
    12.98, 
    19.98, 
    20.99, 
    35.97, 
    78.93
  ], 
  "high" : 5, 
  "low" : 4 
 }

Notice that the high and low indices contain just a single value now.

The Home Stretch

We're almost there - just a couple more steps.

Now that we have the high and low indices for the MEDIAN values, as well as a sorted array containing those values, we can retrieve the middle items out of the array. We'll do this by creating a projection for the two middle values, calling them beginValue and endValue:

{
    $project: {
      "beginValue": {
        "$arrayElemAt": ["$values" , "$high"]
        } ,
      "endValue": {
         "$arrayElemAt": ["$values" , "$low"]
      }
    }
  }

Now we finally have our MEDIAN values:

{ "_id" : null, "beginValue" : 19.98, "endValue" : 12.98 }

We don't really need to provide any conditional logic here; this will retrieve two different elements if our original data set count was odd or the same element if our original data set was even. If the elements are different, we can average them together to get the MEDIAN value and if the elements are the same, the average acts as an identity function that ends up returning the correct value anyway:

{
    $project: {
      "median": {
        "$avg": ["$beginValue" , "$endValue"]
      }      
    }
  }

Our final MEDIAN value has now been computed, and since we're no longer interested in preserving any of the original data in this query, we return only the median in our projection.

The Final Result

Phew - what a ride!

After everything is all said and done, the final MongoDB aggregation looks like the following:

> db.transactions.aggregate([ 
  {
    $match: {
      order_value: {
        $exists: true
      }
    }
  },
  { 
    $group: {
      _id: null, 
      count: { 
        $sum: 1 
      }, 
      values: { 
        $push: "$order_value" 
      } 
    } 
  }, 
  { 
    "$unwind": "$values" 
  },
  { 
    "$sort": { 
      values: 1
    } 
  }, 
  {
    $project: { 
      "count": 1, 
      "values": 1, 
      "midpoint": { 
        $divide: [
          "$count", 
          2 
        ] 
      }
    }
  }, {
    $project: {
      "count": 1,
      "values": 1,
      "midpoint": 1,
      "high": { 
        $ceil: "$midpoint"
      },
      "low": {
        $floor: "$midpoint"
      }
    }
  },
  { 
    $group: {
      _id: null,
      values: {
        $push: "$values"
      }, 
      high: {
        $avg: "$high"
      },
      low: {
        $avg: "$low"
      }
    }
  }, {
    $project: {
      "beginValue": {
        "$arrayElemAt": ["$values" , "$high"]
        } ,
      "endValue": {
         "$arrayElemAt": ["$values" , "$low"]
      }
    }
  }, {
    $project: {
      "median": {
        "$avg": ["$beginValue" , "$endValue"]
      }      
    }
  }
]);

You can use this solution on a data set with either odd or even counts and it will return the correct MEDIAN values.

Exit Stage Left

The MEDIAN calculation is a tricky one that requires more steps than the MEAN to compute, but MEDIAN provides a valuable benchmark upon which we can measure the accuracy of our MEAN. The MongoDB aggregations pipeline proves its power and versatility by making the MEDIAN calculation possible. There's still one more classic metric we can use, MODE, and in the next article in this series, we'll explore how to use the MongoDB aggregations pipeline to compute it.


If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com. We're happy to hear from you.

attributionLuis Llerena

John O'Connor
John O'Connor is a code junky, educator, and amateur dad that loves letting the smoke out of gadgets, turning caffeine into code, and writing about it all. Love this article? Head over to John O'Connor’s author page and keep reading.