Metrics Maven: Mode d'Emploi - Finding the Mode in PostgreSQL

Published

In our Metrics Maven series, Compose's data scientist shares database features, tips, tricks, and code you can use to get the metrics you need from your data. In this article, we'll have a look at mode to round out our series on mean, median, and mode.

Mode is the simplest to understand of the three metrics we've been looking at (mean, median, and mode) so we'll keep this article short and sweet and get straight to it. If you want to start with a review of mean or median before looking at mode, then have a look at Calculating a Mean or A Look at Median for a refresher.

For our examples in this article, we'll continue to use the orders data from our dog products catalog that we've used in the previous articles:

order_id | date       | item_count | order_value  
------------------------------------------------
50000    | 2016-09-02 | 3          | 35.97  
50001    | 2016-09-02 | 2          | 7.98  
50002    | 2016-09-02 | 1          | 5.99  
50003    | 2016-09-02 | 1          | 4.99  
50004    | 2016-09-02 | 7          | 78.93  
50005    | 2016-09-02 | 0          | (NULL)  
50006    | 2016-09-02 | 1          | 5.99  
50007    | 2016-09-02 | 2          | 19.98  
50008    | 2016-09-02 | 1          | 5.99  
50009    | 2016-09-02 | 2          | 12.98  
50010    | 2016-09-02 | 1          | 20.99  

Mode

The mode of a series is the most frequently occurring value. In some series this may indicate popularity. In others, it is an indication of commonality, more conspicuous than the average or the median.

For our use case, together with mean and median, the mode can help us really zero in on why we're seeing the results that we do from our hypothetical pet supply business.

Unlike median, for which we covered 4 different query options in our previous article, PostgreSQL offers a built-in function starting in the 9.4 version to find the mode in a series: MODE(). Let's dive right into some examples.

We'll start by finding the mode for item_count with this query:

SELECT MODE() WITHIN GROUP (ORDER BY item_count) AS item_count_mode  
FROM orders;  

As you can see, the syntax for MODE() looks a little awkward. You use the WITHIN GROUP (ORDER BY ...) clause to indicate the field you want to get the mode of. We encountered this clause when finding the median in option 4 of our previous article. This clause is used with the ordered set aggregates introduced in PostgreSQL 9.4, such as PERCENTILE_CONT and RANK. Once you start to use these aggregate functions, you'll easily get the hang of it.

Now back to what we were doing... Our result from the query above is 1. Orders from our dog products catalog contain only 1 item most frequently. That's disappointing for the business. Secretly, we'd hoped customers would buy whole product lines of items for their pooches!

Zeroes and NULLs

You may be wondering right about now how MODE() handles zeroes and NULLs since one of our orders has a "0" item count and a NULL order value. From our previous articles, we know that this is an important aspect to consider for obtaining the best metrics for the use case.

MODE() and the other ordered set aggregates ignore NULL values by default. That's good news because we determined previously that we should be ignoring orders that have a 0 item_count or a NULL order_value. Those would clearly be invalid orders. MODE() does not, however, ignore zeroes. In our case, it does not matter much since we have only one zero value in our orders, but if we didn't know that, we would actually want to write the query including a WHERE condition for the item_count to not be zero, like so:

SELECT MODE() WITHIN GROUP (ORDER BY item_count) AS item_count_mode  
FROM orders  
WHERE item_count <> 0;  

What mode can tell us about our business

Now that we've got the handling of zeroes and NULLs squared away, let's look at the mode for order_value to get more insight into orders:

SELECT MODE() WITHIN GROUP (ORDER BY order_value) AS order_value_mode  
FROM orders;  

The result we get back is $5.99.

Hmmmm.... these are pretty strong indicators for why our business isn't performing as well as we want it to be. Customers are most frequently only purchasing 1 item at a time with a value of $5.99. If we look back at the values we got from mean and median for each of these fields, the story becomes clearer with each metric:

Mean item count = 2.10  
Median item count = 1.5  
Mode item count = 1

Mean order value = $19.98  
Median order value = $10.48  
Mode order value = $5.99  

If we were relying on just the mean (or even the median) to get a sense of our business performance, we would have inadvertently been believing we were doing much better than we actually are. Now, in full recognition of the reality that our orders are not where we want them to be, we can take action. We might offer a discount for customers who purchase multiple items in one order or we might promote higher-priced items more strongly than lower-priced ones. Armed with these metrics, we can decide how to increase orders and improve our business.

Wrapping Up

This concludes our look at mean, median, and mode and why each of them are important metrics to get a handle on. As we've seen, they each provide a slightly different perspective on the data. By using all of them together, we can do a much better job of understanding how our business is doing (and then determining the actions we should take) than by using just one of them alone.

This also concludes 2016 for the Metrics Maven series! Join us next year as we go even deeper into metrics - how to calculate and apply them to get the most from your data. Until then, wishing you all happy holidays!

Image by: Peggy_Marco