In our Metrics Maven series, Compose's data scientist shares database features, tips, tricks, and code you can use to get the metrics you need from your data. Over the next few articles, we'll get cozy with mean, median, and mode in PostgreSQL.
Mean, median and mode - these three metrics can provide valuable insights from your data. Over the next few articles we'll get deeply familiar with each of them and also highlight some key considerations for understanding what's really contributing to the numbers we report.
For our examples, we'll use the following orders data based on the dog products catalog we used in our previous article on crosstab:
order_id | item_count | order_value ------------------------------------------------ 50000 | 3 | 35.97 50001 | 2 | 7.98 50002 | 1 | 5.99 50003 | 1 | 4.99 50004 | 7 | 78.93 50005 | 0 | (NULL) 50006 | 1 | 5.99 50007 | 2 | 19.98 50008 | 1 | 5.99 50009 | 2 | 12.98 50010 | 1 | 20.99
You're probably already familiar with how to calculate a mean, which usually goes by its more common name — average. Here's a quick refresher, though... To find the mean of a set of numbers, you add them all up and then divide by the count. PostgreSQL provides the
AVG aggregate function to compute this metric for you. We've used the
AVG function a few times for various examples in other articles. In the example below, you'll see our use of the
ROUND function with
AVG, which we covered in our Making Data Pretty article:
SELECT ROUND(AVG(item_count),2) AS avg_items, ROUND(AVG(order_value),2) AS avg_value FROM orders;
The mean values (a.k.a. "averages") we get back look like this:
avg_items | avg_value ---------------------- 1.91 | 19.98
Now, in looking at the data set for this query, you may have noticed that order 50005 has 0 items and has a NULL order value, so you may be wondering how it's treated in the average calculation. Let's look.
Nulls and Zeroes
Understanding how NULL values and 0 values are handled is something to be aware of when reporting metrics so that you can make sure you're getting the numbers you expect.
Let's look at the avg_items and avg_value metrics we calculated above. In the case of the avg_items, the average is calculated using all 11 values (0 counts as a value by default). The
AVG function is performing this calculation:
-- avg_items calculation (3 + 2 + 1 + 1 + 7 + 0 + 1 + 2 + 1 + 2 + 1) / 11 = 1.91
In the case of avg_value, however, only 10 values are used in the calculation since NULL values are ignored. That calculation looks like this:
-- avg_value calculation (35.97 + 7.98 + 5.99 + 4.99 + 78.93 + 5.99 + 19.98 + 5.99 + 12.98 + 20.99) / 10 = 19.98
To alter this default behavior (either to ignore 0 values or to include NULL values), we can add some conditions to act on the
AVG function. In this next example, we've chosen to ignore the 0 value for avg_items and to include the NULL value for the avg_value:
SELECT ROUND(AVG(NULLIF(item_count,0)),2) AS avg_items, ROUND(AVG(CASE WHEN order_value IS NULL THEN 0 ELSE order_value END),2) AS avg_value FROM orders;
In the above query, for avg_items, we're using
NULLIF to set 0 values to NULL for the calculation so that they will be ignored by the default behavior. For avg_value, we're using a
CASE WHEN conditional to convert NULL values to 0 so that they will be included by the default behavior. Here's our new result:
avg_items | avg_value ---------------------- 2.10 | 18.16
What we actually want for our report, however, is for any orders with 0 or NULL values to be ignored so our query will end up looking like this:
SELECT ROUND(AVG(NULLIF(item_count,0)),2) AS avg_items, ROUND(AVG(order_value),2) AS avg_value FROM orders;
And we'll have the following values for our final report:
avg_items | avg_value ---------------------- 2.10 | 19.98
Since we've also got quite a few orders with the same item counts and order values, we may want to get an idea of what the means would be if we average only the unique values instead.
By adding a
DISTINCT clause to our query, we can calculate the mean using only the unique values. Just as easy as this:
SELECT ROUND(AVG(DISTINCT NULLIF(item_count,0)),2) AS avg_items, ROUND(AVG(DISTINCT order_value),2) AS avg_value FROM orders;
Our result shows the following:
avg_items | avg_value ---------------------- 3.25 | 23.48
As we can see, the values are quite a bit higher when we remove duplicate values for item count and order value. These would be the values we might expect if our orders were spread more evenly. What this tells us is that our orders tend more toward the lower end of the range for both the item count and the order value, which brings our average values down in our final report. This indicates that we're probably not getting the coverage across our catalog that we would want. If we saw lower values here than we got from our final report, then we'd know that our orders tended more toward the higher end. It's always good to get a sense of whether your data is on the lower or higher end so that your business can decide how (or if) it wants to try to shift that data in the future. In this case, if we wanted to shift our orders more toward the higher end, we might do some kind of promotion to increase the number of items purchased per order and focus an advertising campaign on higher-priced items.
In this article, we've taken the commonly-used
AVG function and explored it a little more deeply to get an understanding of how NULL and 0 values are handled (and how the default behavior can be altered). We also touched on using
DISTINCT to get a sense of whether our data tended to be more towards the higher end or lower end of our spectrum. In future articles, we'll get into cumulative and weighted averages, but before we go there, our next article will take a close look at how to calculate a median and what it can tell us about our data. In the meantime, you may also be interested in learning about how to calculate a moving average.