Mongo Metrics: Calculating the Mode

Published

In this third entry in our Mongo Metrics series, we'll round out the "top 3" classical analytics methods by taking a look at mode. Check out our previous articles in this series to learn more about computing and using the mean and median in MongoDB.

We've seen how the mean and median can provide us different perspectives on what a "typical" order might look like. We've also seen how it's important to have multiple different "angles" on your data to gain a full understanding of what a typical order might look like. Now, let's take a look at one more angle: the mode.

What's in a Mode?

Mode is one of the simpler of the classic methods to understand. Simply put, the mode is the most common item, or the one occurring most frequently, in a set of data. Unlike the mean or median, we may not necessarily obtain a useful result with mode. For example, if all of the items in our dataset are encountered exactly once, then mode won't give us a useful result.

Let's take a look at a data set where the mode has great value: determining which products or price points are popular. Mode is great for this because stores will often price many items at the same price points. By analyzing how well products do at various price points, stores can determine more efficient pricing and improve their overall sales.

For this example, we'll borrow the pet store product catalog from our Metrics Maven's article on mode in PostgreSQL:

order_id | date       | item_count | order_value  
------------------------------------------------
50000    | 2016-09-02 | 3          | 35.97  
50001    | 2016-09-02 | 2          | 7.98  
50002    | 2016-09-02 | 1          | 5.99  
50003    | 2016-09-02 | 1          | 4.99  
50004    | 2016-09-02 | 7          | 78.93  
50005    | 2016-09-02 | 0          | (NULL)  
50006    | 2016-09-02 | 1          | 5.99  
50007    | 2016-09-02 | 2          | 19.98  
50008    | 2016-09-02 | 1          | 5.99  
50009    | 2016-09-02 | 2          | 12.98  
50010    | 2016-09-02 | 1          | 20.99  

Which, stored in JSON format in MongoDB, looks like the following:

{ "_id" : ObjectId("58db58313b9bbe23a46e91af"), "order_id" : 50005, "date" : ISODate("2016-09-02T00:00:00Z"), "item_count" : 0 }
{ "_id" : ObjectId("58db58873b9bbe21cb6e91b1"), "order_id" : 50002, "date" : ISODate("2016-09-02T00:00:00Z"), "item_count" : 1, "order_value" : 5.99 }
{ "_id" : ObjectId("58db58d33b9bbe1f886e91b0"), "order_id" : 50003, "date" : ISODate("2016-09-02T00:00:00Z"), "item_count" : 1, "order_value" : 4.99 }
{ "_id" : ObjectId("58db58fc3b9bbe21cb6e91b2"), "order_id" : 50010, "date" : ISODate("2016-09-02T00:00:00Z"), "item_count" : 1, "order_value" : 20.99 }
{ "_id" : ObjectId("58db591d3b9bbe21cb6e91b3"), "order_id" : 50006, "date" : ISODate("2016-09-02T00:00:00Z"), "item_count" : 1, "order_value" : 5.99 }
{ "_id" : ObjectId("58db59403b9bbe21cb6e91b4"), "order_id" : 50008, "date" : ISODate("2016-09-02T00:00:00Z"), "item_count" : 1, "order_value" : 5.99 }
{ "_id" : ObjectId("58db596a3b9bbe21cb6e91b5"), "order_id" : 50009, "date" : ISODate("2016-09-02T00:00:00Z"), "item_count" : 2, "order_value" : 12.98 }
{ "_id" : ObjectId("58db598c3b9bbe240a6e91ae"), "order_id" : 50007, "date" : ISODate("2016-09-02T00:00:00Z"), "item_count" : 2, "order_value" : 19.98 }
{ "_id" : ObjectId("58db59ac3b9bbec65f6e91c0"), "order_id" : 50000, "date" : ISODate("2016-09-02T00:00:00Z"), "item_count" : 3, "order_value" : 35.97 }
{ "_id" : ObjectId("58db59d43b9bbe21cb6e91b6"), "order_id" : 50004, "date" : ISODate("2016-09-02T00:00:00Z"), "item_count" : 7, "order_value" : 78.93 }

Unlike PostgreSQL, MongoDB doesn't have a MODE keyword so we'll have to compute it ourselves. Luckily, the MongoDB aggregations pipeline comes to the rescue yet again. Let's take a look at how we can use it to compute the mode of our data set.

Getting in the Mode

Before we get started, make sure you have a foundational understanding of the $match and $group operators in the MongoDB aggregation pipeline. If you need some background, you can check out our previous article on MongoDB aggregations by example.

For our first step, we need to figure out which fields we want to calculate the mode on. Let's start by getting the mode of the order_value field so we can get a better picture of what a typical order value might be.

Mode is calculated by grouping the data in the data set together based on order_value, counting the number of items in each group, and finding the group with the highest count. We can do that using the $group and $sum aggregation operators and filtering out any NULL or invalid fields by first running it through a $match operation. Then, we'll sort the results in descending order using the $sort aggregation operator. Finally, we'll return only the first document in the sort by using the $limit operator.

Our aggregation starts with the $match operator to filter out our NULL values:

  {
    $match: {
      order_value: {
        $exists: true
      }
    }
  }

Next, let's run our $group query to group all of our order values into distinct groups. We'll also need to count the number of times an order_value occurs so we can sort it later. We can do that all in one shot with the following query:

{ 
  $group: { 
    _id: "$order_value", 
    count: { 
      $sum: 1 
    }
  }
}

Once this stage of the pipeline is reached, our data should now look like the following:

{ "_id" : 78.93, "count" : 1 }
{ "_id" : 35.97, "count" : 1 }
{ "_id" : 19.98, "count" : 1 }
{ "_id" : 20.99, "count" : 1 }
{ "_id" : 12.98, "count" : 1 }
{ "_id" : 4.99, "count" : 1 }
{ "_id" : 5.99, "count" : 3 }

The last step now is to find the order_values with the maximum count. There are a few ways we can do this, but one of the simplest is to sort the data by the count field and then just return the top result.

First, let's sort the data using the $sort aggregation. We'll sort on the count field, and sort in descending order:

{ 
  $sort: { 
    "count": -1
  }
}  

This should give us the following result:

{ "_id" : 5.99, "count" : 3 }
{ "_id" : 78.93, "count" : 1 }
{ "_id" : 35.97, "count" : 1 }
{ "_id" : 19.98, "count" : 1 }
{ "_id" : 20.99, "count" : 1 }
{ "_id" : 12.98, "count" : 1 }
{ "_id" : 4.99, "count" : 1 }

Finally, we'll use the $limit aggregation to simply limit the return values to only the first one:

{ $limit: 1 }

This should return only the first document that we matched:

{ "_id" : 5.99, "count" : 3 }

And there's our mode - our most common order is one with a value of $5.99, and it was encountered 3 times. Our completed query looks like the following:

> db.transactions.aggregate([
  { 
    $match: { 
      order_value: { $exists: true } 
    }
  }, { 
    $group: { 
      _id: "$order_value", 
      count: { $sum: 1 } 
    } 
  }, { 
    $sort: { "count": -1} 
  } , { 
    $limit: 1 
  } 
])

We can also calculate the mode for the number of items purchase in a transaction by performing the same calculation on the item_count field:

> db.transactions.aggregate([
  { 
    $match: { 
      item_count: { $exists: true } 
    }
  }, { 
    $group: { 
      _id: "$item_count", 
      count: { $sum: 1 } 
    } 
  }, { 
    $sort: { "count": -1} 
  } , { 
    $limit: 1 
  } 
])

Which gives us the following:

{ "_id" : 1, "count" : 5 }

This means that the most common number of items in a transaction is 1, and it occured in 5 transactions.

Why Should I Care?

That's a great question - with all of those wonderful metrics out there, why should you care about the mode? Like always, it comes down to giving you a different perspective on your data. You can read an excellent writeup about the differences in the Metrics Maven article on Mode, and the following table from that article is perhaps the most insightful:

Mean item count = 2.10  
Median item count = 1.5  
Mode item count = 1

Mean order value = $19.98  
Median order value = $10.48  
Mode order value = $5.99  

When you have data that's likely to repeat itself (ie: repeated transactions), the mode can show you details that mean and median don't. If we expected the mean or even the median to help us determine what to expect from a typical order, we might be very surprised when our projections were substantially off. Our median order value of $10.48 is almost double what our most frequent order price actually is. Mean here is almost completely useless as it is heavily skewed by a few outliers.

Wrapping It Up

While these are certainly not the only way to compute metrics across our MongoDB data sets, the three "classic" statistical methods are a great starting point for analyzing your data. We'll continue this series at a later date by exploring more ways we can analyze and gain insights from our MongoDB data.


If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com. We're happy to hear from you.

John O'Connor
John O'Connor is a code junky, educator, and amateur dad that loves letting the smoke out of gadgets, turning caffeine into code, and writing about it all. Love this article? Head over to John O'Connor’s author page to keep reading.

Conquer the Data Layer

Spend your time developing apps, not managing databases.