In our Metrics Maven series, Compose's data scientist shares database features, tips, tricks, and code you can use to get the metrics you need from your data. In this article we'll cozy-up to calculating a median in PostgreSQL.
To set the stage for median, in our previous article we learned about calculating a mean -- what the default settings are for the
AVG function, how to change those as appropriate for our analytic needs, and using
AVG to get a sense of whether our data was on the higher end or the lower end. In this article we're going to build on what we previously learned by turning our attention to median. Calculating a median can be a bit trickier than it sounds, especially since there's not a built-in function for it, so we'll look at some different options for the calculation so that you can get familiar with them and choose the one that works best for your situation.
The median is the middle value in an ordered series. If there are an odd number of values in the ordered series, then the middle value is the median. If there are an even number of values, then the mean of the two middle values is the median.
The median value can be a more representative metric than the mean if the data contains outliers that may pull the data one way versus another. Finding the middle shows you just how much your mean value may be misleading you. It's always good to look at the median and mean together with the mode (which we'll look at in our next article) to get the best sense of your data. Companies that rely only on averages for their metrics are walking a dangerous path if they're basing key decisions on those averages. Incorporating median and mode metrics along with mean can provide a much more solid foundation for decision-making.
Carrying on from our previous article on mean, we'll use the orders data we used there for our examples in this article as well. Before we get into our SQL query options for the median calculation, let's go through the data and eyeball what the median should be so that we get a handle on a couple considerations we'll need to make.
In our orders table, we have an odd number of rows so we can identify the middle row for the median value. First we need to order the values we want to consider, however. Let's start with finding the median item count.
Here's our orders table ordered by item_count:
order_id | date | item_count | order_value ------------------------------------------------ 50005 | 2016-09-02 | 0 | (NULL) 50002 | 2016-09-02 | 1 | 5.99 50003 | 2016-09-02 | 1 | 4.99 50010 | 2016-09-02 | 1 | 20.99 50006 | 2016-09-02 | 1 | 5.99 50008 | 2016-09-02 | 1 | 5.99 50009 | 2016-09-02 | 2 | 12.98 50007 | 2016-09-02 | 2 | 19.98 50001 | 2016-09-02 | 2 | 7.98 50000 | 2016-09-02 | 3 | 35.97 50004 | 2016-09-02 | 7 | 78.93
The middle row is order_id 50008 so our median value for the item_count is 1. The 0 item_count value in the first row is a valid value for the median calculation. If we wanted to exclude it, we'd need to add logic to our calculation to include values only greater than 0. We'll actually do that in just a little bit.
Let's find the median for the order_value now:
order_id | date | item_count | order_value ------------------------------------------------ 50003 | 2016-09-02 | 1 | 4.99 50006 | 2016-09-02 | 1 | 5.99 50002 | 2016-09-02 | 1 | 5.99 50008 | 2016-09-02 | 1 | 5.99 50001 | 2016-09-02 | 2 | 7.98 50009 | 2016-09-02 | 2 | 12.98 50007 | 2016-09-02 | 2 | 19.98 50010 | 2016-09-02 | 1 | 20.99 50000 | 2016-09-02 | 3 | 35.97 50004 | 2016-09-02 | 7 | 78.93 50005 | 2016-09-02 | 0 | (NULL)
Here the middle row is order_id 50009 so the median order_value is $12.98.
But wait! What about that NULL value sorted to the bottom there? It will be considered part of the series unless we explicitly exclude NULL values in our calculation or unless our calculation uses a function that ignores NULL values by default. Since in our case we do want to exclude it, we'll now have an even number of rows:
order_id | date | item_count | order_value ------------------------------------------------ 50003 | 2016-09-02 | 1 | 4.99 50006 | 2016-09-02 | 1 | 5.99 50002 | 2016-09-02 | 1 | 5.99 50008 | 2016-09-02 | 1 | 5.99 50001 | 2016-09-02 | 2 | 7.98 50009 | 2016-09-02 | 2 | 12.98 50007 | 2016-09-02 | 2 | 19.98 50010 | 2016-09-02 | 1 | 20.99 50000 | 2016-09-02 | 3 | 35.97 50004 | 2016-09-02 | 7 | 78.93
Now we've got two middle rows, order_id 50001 and order_id 50009. We'll need to find the mean of the two order values:
-- average the two middle values (7.98 + 12.98) / 2 = 10.48
Our new median order_value is $10.48.
In our case, we also want to exclude orders from our metrics that have a 0 item_count. Here's our reasoning: an order with a 0 item_count or a NULL order_value is clearly an invalid order. We excluded 0 values and NULL values in our metrics for mean in the previous article and we need to do the same here. Let's re-calculate the median for item_count, then, this time excluding the 0 value:
order_id | date | item_count | order_value ------------------------------------------------ 50002 | 2016-09-02 | 1 | 5.99 50003 | 2016-09-02 | 1 | 4.99 50010 | 2016-09-02 | 1 | 20.99 50006 | 2016-09-02 | 1 | 5.99 50008 | 2016-09-02 | 1 | 5.99 50009 | 2016-09-02 | 2 | 12.98 50007 | 2016-09-02 | 2 | 19.98 50001 | 2016-09-02 | 2 | 7.98 50000 | 2016-09-02 | 3 | 35.97 50004 | 2016-09-02 | 7 | 78.93
Now we've got an even number of rows for the item_count calculation, too, so we'll find the mean of the two middle values. In this case our middle rows are order_id 50008 and order_id 50009 so here's our calculation:
-- average the two middle values 1 + 2 / 2 1.5
Our median item_count is now 1.5.
While the differences from our dataset are not staggering when we exclude 0 and NULL values, in other datasets they very well may be. Consider carefully how you want these value types to be treated for your median calculations. Also, note that if you want to include 0 values and you change any NULL values to 0 in your calculation, the 0 value will sort to the top of the series whereas the NULL value will sort to the bottom. This shift can also significantly affect your median.
If we compare our medians to the means we calculated in the last article, we now have a better understanding of why our orders were on the lower side when we compared the averages we calculated across our dataset to the averages we determined while using
DISTINCT values only ... the medians are much lower.
Mean item count = 2.10 Median item count = 1.5 Mean order value = $19.98 Median order value = $10.48
This means that we have some outliers in the orders that are actually pulling our average values higher than the median. So, even though in the previous article we got a sense that orders were on the lower side than we wanted them to be (by using
AVG), our medians now confirm that. The mean values by themselves might have led us to believe we were doing better on orders than we actually are. Using the median to complement the mean helps us get a fuller picture of how our business is doing. Orders are indeed lower, both in item count and order value, than we would like. Now that we know this, we can decide what our business should do about it.
Query options for median
OK, now that we know what our median values are, how can we formulate a query to calculate the median for us for ongoing reporting? Since PostgreSQL does not come with a built-in function for median, we've got a handful of options. Let's go through each of them. We're going to start with the simplest option to understand, but the most complicated to write out and work toward the option that is the simplest to write out, but can be the most difficult to understand. If you want to jump straight to that one, check out Option 4 below.
Option 1 - Get to the middle using max of the top 50% and min of the bottom 50%
One method involves taking the maximum value from the top 50% and the maximum value from the bottom 50% of ordered rows then taking the average of the two results. Here's what that looks like:
SELECT ROUND((t.middle1 + b.middle2) / 2, 2) AS median_item_count FROM ( SELECT MAX(item_count) AS middle1 FROM ( SELECT item_count, COUNT(*) OVER() AS row_count, ROW_NUMBER() OVER (ORDER BY item_count) AS row_number FROM orders WHERE item_count <> 0 ) top WHERE (row_number::float / row_count) <= 0.50 ) t, ( SELECT MIN(item_count) AS middle2 FROM ( SELECT item_count, COUNT(*) OVER() AS row_count, ROW_NUMBER() OVER (ORDER BY item_count DESC) AS row_number FROM orders WHERE item_count <> 0 ) bottom WHERE (row_number::float / row_count) <= 0.50 ) b ;
Queries within queries wrapped up in a query - that's an eye-full! We could use CTEs (common table expressions) here using the
WITH clause to make this a little more readable, but it would still be a lengthy and complex query, though the concept behind it is simple.
The way it works is that it orders the rows by item_count in ascending order, counts and numbers the rows which do not have an item_count of 0 (as we determined we needed to do in the section above), then selects the maximum item_count value from the first 50% of rows in the table. That maximum value in our case would be 1.
Here's a way to look at the top 50% of rows that might help this make more sense:
order_id | item_count | row_count | row_number | percentile ----------------------------------------------------------- 50010 | 1 | 10 | 1 | 0.1 50002 | 1 | 10 | 2 | 0.2 50003 | 1 | 10 | 3 | 0.3 50006 | 1 | 10 | 4 | 0.4 50008 | 1 | 10 | 5 | 0.5
The maximum item_count has to be 1 since that's the only value we see in this set.
The same exact query is performed to get the bottom 50% of rows except we put them in descending order and then we take the minimum value instead of the maximum one. That value would be 2 in this case. Here's a way to look at the bottom 50% from our dataset to understand why:
order_id | item_count | row_count | row_number | percentile ----------------------------------------------------------- 50004 | 7 | 10 | 1 | 0.1 50000 | 3 | 10 | 2 | 0.2 50001 | 2 | 10 | 3 | 0.3 50007 | 2 | 10 | 4 | 0.4 50009 | 2 | 10 | 5 | 0.5
2 is the minimum item_count value from this set.
Instead of using
DESC for the bottom 50%, you could also alternatively change the "<= 0.50" condition to "> 0.50". That will produce the same result. Just a slightly different way of pulling the bottom 50% set.
Finally, those two middle values (1 from the top 50% and 2 from the bottom 50%) are averaged to produce the median of 1.50.
As you can tell by the
OVER clauses in each of the subqueries, this query uses window functions for the row count and row number. Note, too, the use of
ROUND to 2 decimal places, which we previously reviewed in Making Data Pretty to round to the specified number of decimal points, and the use of the
::float cast so that our percent calculation will actually be returned as a decimal (performing math with two integers returns an integer by default, which isn't granular enough for us to use for the percentile so we need to cast the first integer row_number as a float to return a result with a decimal point).
Let's move on to a less complicated-looking query.
Option 2 - Finding row numbers in the middle of row counts
A similar, but slightly different, way to do the same thing is by identifying the two middle rows as a list to be passed to the
SELECT ROUND(AVG(item_count), 2) AS median_item_count FROM ( SELECT item_count FROM ( SELECT item_count, COUNT(*) OVER() AS row_count, ROW_NUMBER() OVER(ORDER BY item_count) AS row_number FROM orders WHERE item_count <> 0 ) x WHERE row_number IN ((row_count + 1)/2, (row_count + 2)/2) ) y ;
Still a query within a query within a query, but less overwhelming. This is essentially the same thing as the previous query, but a little more compact. Instead of separating the values into the top and bottom 50% and then taking the minimum and maximum values, it checks if the row number is one of the middle rows by using row count.
In this case, we specifically don't use the
::float cast since we don't want the row count calculations to resolve to a decimal value. We want to keep the results there as integers to be able to be compared with the row number. For example, in our case, the row count is 10. So, if we perform the first row count calculation (10 + 1)/2 we'd get 5.5 if we specified a
::float cast. By not specifying it, the result resolves to just a 5 (because math operations with integers will produce an integer by default in PostgreSQL), which is the level of granularity we want here. In our case, this query will return row numbers 5 and 6 as the middle rows.
Once the middle rows are identified, the middle values for item_count are averaged. We'll get 1.50 from this query for our dataset.
Option 3 - Where ascending and descending rows meet in the middle
The next option is a twist on the previous one and also uses window functions with
ROW_NUMBER() like the previous two options, but it doesn't use
COUNT(). In this option, all the rows are ordered in ascending and descending order. Then the ascending order row numbers are compared to the descending order rows. Where the ascending order row number is between the descending order row numbers plus or minus 1, those are the ones in the middle. The middle values identified are averaged just like in the previous query. The result from this query for our dataset will be 1.50.
SELECT ROUND(AVG(item_count), 2) as median_item_count FROM ( SELECT item_count, ROW_NUMBER() OVER (ORDER BY item_count, order_id) AS rows_ascending, ROW_NUMBER() OVER (ORDER BY item_count DESC, order_id DESC) AS rows_descending FROM orders WHERE item_count <> 0 ) AS x WHERE rows_ascending BETWEEN rows_descending - 1 AND rows_descending + 1;
Here we've only got one query inside a query.
Note that this option requires your data to have ordered ids to line up the rows the right way if you have duplicate values in the field you are trying to find the median of. In our case, our item_count field has a few duplicate values so we use order_id to line up the rows secondary to the item_count itself.
These were the ways to calculate a median until PostgreSQL 9.4 when the ordered set aggregate functions were introduced. They still work great, but you may find this last option easiest to work with.
Option 4 - Continuous percentile
PERCENTILE_CONT is one of the ordered set aggregate functions that was introduced in PostgreSQL 9.4 and it can help us calculate a median value without any fuss. It uses linear interpolation to find the requested value in a continuous distribution. The wikipedia link is provided for reference if you want to learn more, but basically this just means that the median is found by finding the "between" value of two other values. Exactly what we want.
SELECT ROUND(PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY item_count)::numeric, 2) AS median_item_count FROM orders WHERE item_count <> 0;
Because we're looking for the median, we use "0.50" for
WITHIN GROUP clause with the
ORDER BY clause creates the ordered subset that
PERCENTILE_CONT will operate across. The "between" value that this function will arrive at for our dataset is 1.50 — the median. Note that we've cast the result from
PERCENTILE_CONT as numeric using
::numeric. We've done this so that we cn use the
ROUND function to get 2 decimal places as we have in the previous options. Since
PERCENTILE_CONT returns a double precision result by default, it would just resolve to 1.5 if we did not use
ROUND with the
We can use any of these queries to find the median for the order_value as well. They're return $10.48. To do so, we'd alter the field names and ordering, of course, and we'd replace the "WHERE item_count <> 0" with "WHERE order_value IS NOT NULL" in options 1-3. Even though options 2 and 3 use the
AVG function (which as we learned in the last article ignores NULL values by default), we need to remove the NULL value row before our query even gets to the point where it's performing the
AVG. Note that for
PERCENTILE_CONT, NULL values are automatically ignored so we do not need to include the "WHERE order_value IS NOT NULL" in that case. This is an important difference between it and the previous options as well.
In this article we learned how median can give us a fuller picture of how our business is doing than mean can by itself. To calculate the median, we looked at four different options, each with its own level of complexity and compactness. Depending on the size of your dataset, you may also notice performance differences as well. Play around with each of them with your data to determine which one works best for your use case.
In our next article, we'll get into mode to complete our overview of mean, median, and mode metrics.
If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at firstname.lastname@example.org. We're happy to hear from you.