## Mongo Metrics: Calculating the Mode

Published*In this third entry in our Mongo Metrics series, we'll round out the "top 3" classical analytics methods by taking a look at mode. Check out our previous articles in this series to learn more about computing and using the mean and median in MongoDB.*

We've seen how the mean and median can provide us different perspectives on what a "typical" order might look like. We've also seen how it's important to have multiple different "angles" on your data to gain a full understanding of what a typical order might look like. Now, let's take a look at one more angle: the mode.

# What's in a Mode?

Mode is one of the simpler of the classic methods to understand. Simply put, the *mode* is the most common item, or the one occurring most frequently, in a set of data. Unlike the *mean* or *median*, we may not necessarily obtain a useful result with mode. For example, if all of the items in our dataset are encountered exactly once, then *mode* won't give us a useful result.

Let's take a look at a data set where the *mode* has great value: determining which products or price points are popular. *Mode* is great for this because stores will often price many items at the same price points. By analyzing how well products do at various price points, stores can determine more efficient pricing and improve their overall sales.

For this example, we'll borrow the pet store product catalog from our Metrics Maven's article on *mode* in PostgreSQL:

```
order_id | date | item_count | order_value
------------------------------------------------
50000 | 2016-09-02 | 3 | 35.97
50001 | 2016-09-02 | 2 | 7.98
50002 | 2016-09-02 | 1 | 5.99
50003 | 2016-09-02 | 1 | 4.99
50004 | 2016-09-02 | 7 | 78.93
50005 | 2016-09-02 | 0 | (NULL)
50006 | 2016-09-02 | 1 | 5.99
50007 | 2016-09-02 | 2 | 19.98
50008 | 2016-09-02 | 1 | 5.99
50009 | 2016-09-02 | 2 | 12.98
50010 | 2016-09-02 | 1 | 20.99
```

Which, stored in JSON format in MongoDB, looks like the following:

```
{ "_id" : ObjectId("58db58313b9bbe23a46e91af"), "order_id" : 50005, "date" : ISODate("2016-09-02T00:00:00Z"), "item_count" : 0 }
{ "_id" : ObjectId("58db58873b9bbe21cb6e91b1"), "order_id" : 50002, "date" : ISODate("2016-09-02T00:00:00Z"), "item_count" : 1, "order_value" : 5.99 }
{ "_id" : ObjectId("58db58d33b9bbe1f886e91b0"), "order_id" : 50003, "date" : ISODate("2016-09-02T00:00:00Z"), "item_count" : 1, "order_value" : 4.99 }
{ "_id" : ObjectId("58db58fc3b9bbe21cb6e91b2"), "order_id" : 50010, "date" : ISODate("2016-09-02T00:00:00Z"), "item_count" : 1, "order_value" : 20.99 }
{ "_id" : ObjectId("58db591d3b9bbe21cb6e91b3"), "order_id" : 50006, "date" : ISODate("2016-09-02T00:00:00Z"), "item_count" : 1, "order_value" : 5.99 }
{ "_id" : ObjectId("58db59403b9bbe21cb6e91b4"), "order_id" : 50008, "date" : ISODate("2016-09-02T00:00:00Z"), "item_count" : 1, "order_value" : 5.99 }
{ "_id" : ObjectId("58db596a3b9bbe21cb6e91b5"), "order_id" : 50009, "date" : ISODate("2016-09-02T00:00:00Z"), "item_count" : 2, "order_value" : 12.98 }
{ "_id" : ObjectId("58db598c3b9bbe240a6e91ae"), "order_id" : 50007, "date" : ISODate("2016-09-02T00:00:00Z"), "item_count" : 2, "order_value" : 19.98 }
{ "_id" : ObjectId("58db59ac3b9bbec65f6e91c0"), "order_id" : 50000, "date" : ISODate("2016-09-02T00:00:00Z"), "item_count" : 3, "order_value" : 35.97 }
{ "_id" : ObjectId("58db59d43b9bbe21cb6e91b6"), "order_id" : 50004, "date" : ISODate("2016-09-02T00:00:00Z"), "item_count" : 7, "order_value" : 78.93 }
```

Unlike PostgreSQL, MongoDB doesn't have a `MODE`

keyword so we'll have to compute it ourselves. Luckily, the MongoDB aggregations pipeline comes to the rescue yet again. Let's take a look at how we can use it to compute the *mode* of our data set.

# Getting in the Mode

*Before we get started, make sure you have a foundational understanding of the $match and $group operators in the MongoDB aggregation pipeline. If you need some background, you can check out our previous article on MongoDB aggregations by example.*

For our first step, we need to figure out which fields we want to calculate the *mode* on. Let's start by getting the *mode* of the *order_value* field so we can get a better picture of what a typical order value might be.

Mode is calculated by grouping the data in the data set together based on *order_value*, counting the number of items in each group, and finding the group with the highest count. We can do that using the `$group`

and `$sum`

aggregation operators and filtering out any `NULL`

or invalid fields by first running it through a `$match`

operation. Then, we'll sort the results in descending order using the `$sort`

aggregation operator. Finally, we'll return only the first document in the sort by using the `$limit`

operator.

Our aggregation starts with the `$match`

operator to filter out our `NULL`

values:

```
{
$match: {
order_value: {
$exists: true
}
}
}
```

Next, let's run our `$group`

query to group all of our order values into distinct groups. We'll also need to count the number of times an *order_value* occurs so we can sort it later. We can do that all in one shot with the following query:

```
{
$group: {
_id: "$order_value",
count: {
$sum: 1
}
}
}
```

Once this stage of the pipeline is reached, our data should now look like the following:

```
{ "_id" : 78.93, "count" : 1 }
{ "_id" : 35.97, "count" : 1 }
{ "_id" : 19.98, "count" : 1 }
{ "_id" : 20.99, "count" : 1 }
{ "_id" : 12.98, "count" : 1 }
{ "_id" : 4.99, "count" : 1 }
{ "_id" : 5.99, "count" : 3 }
```

The last step now is to find the *order_value*s with the maximum count. There are a few ways we can do this, but one of the simplest is to sort the data by the *count* field and then just return the top result.

First, let's sort the data using the `$sort`

aggregation. We'll sort on the *count* field, and sort in descending order:

```
{
$sort: {
"count": -1
}
}
```

This should give us the following result:

```
{ "_id" : 5.99, "count" : 3 }
{ "_id" : 78.93, "count" : 1 }
{ "_id" : 35.97, "count" : 1 }
{ "_id" : 19.98, "count" : 1 }
{ "_id" : 20.99, "count" : 1 }
{ "_id" : 12.98, "count" : 1 }
{ "_id" : 4.99, "count" : 1 }
```

Finally, we'll use the `$limit`

aggregation to simply limit the return values to only the first one:

```
{ $limit: 1 }
```

This should return only the first document that we matched:

```
{ "_id" : 5.99, "count" : 3 }
```

And there's our mode - our most common order is one with a value of $5.99, and it was encountered 3 times. Our completed query looks like the following:

```
> db.transactions.aggregate([
{
$match: {
order_value: { $exists: true }
}
}, {
$group: {
_id: "$order_value",
count: { $sum: 1 }
}
}, {
$sort: { "count": -1}
} , {
$limit: 1
}
])
```

We can also calculate the mode for the number of items purchase in a transaction by performing the same calculation on the *item_count* field:

```
> db.transactions.aggregate([
{
$match: {
item_count: { $exists: true }
}
}, {
$group: {
_id: "$item_count",
count: { $sum: 1 }
}
}, {
$sort: { "count": -1}
} , {
$limit: 1
}
])
```

Which gives us the following:

```
{ "_id" : 1, "count" : 5 }
```

This means that the most common number of items in a transaction is 1, and it occured in 5 transactions.

# Why Should I Care?

That's a great question - with all of those wonderful metrics out there, why should you care about the *mode*? Like always, it comes down to giving you a different perspective on your data. You can read an excellent writeup about the differences in the Metrics Maven article on Mode, and the following table from that article is perhaps the most insightful:

```
Mean item count = 2.10
Median item count = 1.5
Mode item count = 1
Mean order value = $19.98
Median order value = $10.48
Mode order value = $5.99
```

When you have data that's likely to repeat itself (ie: repeated transactions), the mode can show you details that mean and median don't. If we expected the *mean* or even the *median* to help us determine what to expect from a typical order, we might be very surprised when our projections were substantially off. Our median order value of $10.48 is almost double what our most frequent order price actually is. Mean here is almost completely useless as it is heavily skewed by a few outliers.

# Wrapping It Up

While these are certainly not the only way to compute metrics across our MongoDB data sets, the three "classic" statistical methods are a great starting point for analyzing your data. We'll continue this series at a later date by exploring more ways we can analyze and gain insights from our MongoDB data.

If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com. We're happy to hear from you.