In our Metrics Maven series, Compose's data scientist shares database features, tips, tricks, and code you can use to get the metrics you need from your data. In this article, we'll look at how to calculate a weighted average.
Building on one of our previous articles about calculating a mean, in this article we'll look at how to calculate a weighted average and apply that to a group of products from our hypothetical pet supply company.
If you recall from the previous article, these are the orders we received for a variety of dog products that we sell:
order_id | item_count | order_value ------------------------------------------------ 50000 | 3 | 35.97 50001 | 2 | 7.98 50002 | 1 | 5.99 50003 | 1 | 4.99 50004 | 7 | 78.93 50005 | 0 | (NULL) 50006 | 1 | 5.99 50007 | 2 | 19.98 50008 | 1 | 5.99 50009 | 2 | 12.98 50010 | 1 | 20.99
In the previous article, we determined our mean order value to be 19.98 with an average number of items per order of 2.10. In subsequent articles, we also looked at the median and the mode of the orders so that we could get a more complete picture of how our business was doing. Let's dig even a bit deeper now to see which products these orders consisted of so that we can get more insight into how different products are performing.
Here are the details of the orders that we'll be using in this article:
order_id | date | customer_id | product_id | product | category | product_line | price -------------------------------------------------------------------------------------------------- 50000 | 2016-09-02 | 667325 | 1 | leash | dog wear | Bowser | 15.99 50000 | 2016-09-02 | 667325 | 2 | collar | dog wear | Bowser | 10.99 50000 | 2016-09-02 | 667325 | 6 | plushy | dog toys | Bowser | 8.99 50001 | 2016-09-02 | 667326 | 8 | rubber bone | dog toys | Tippy | 4.99 50001 | 2016-09-02 | 667326 | 10 | ball | dog toys | Tippy | 2.99 50002 | 2016-09-02 | 667327 | 13 | name tag | dog wear | Tippy | 5.99 50003 | 2016-09-02 | 667328 | 8 | rubber bone | dog toys | Tippy | 4.99 50004 | 2016-09-02 | 667329 | 1 | leash | dog wear | Bowser | 15.99 50004 | 2016-09-02 | 667329 | 2 | collar | dog wear | Bowser | 10.99 50004 | 2016-09-02 | 667329 | 3 | name tag | dog wear | Bowser | 5.99 50004 | 2016-09-02 | 667329 | 4 | jacket | dog wear | Bowser | 24.99 50004 | 2016-09-02 | 667329 | 5 | ball | dog toys | Bowser | 6.99 50004 | 2016-09-02 | 667329 | 6 | plushy | dog toys | Bowser | 8.99 50004 | 2016-09-02 | 667329 | 7 | rubber bone | dog toys | Bowser | 4.99 50005 | 2016-09-02 | (NULL) | (NULL) | (NULL) | (NULL) | (NULL) | (NULL) 50006 | 2016-09-02 | 667330 | 13 | name tag | dog wear | Tippy | 5.99 50007 | 2016-09-02 | 667331 | 11 | leash | dog wear | Tippy | 12.99 50007 | 2016-09-02 | 667331 | 12 | collar | dog wear | Tippy | 6.99 50008 | 2016-09-02 | 667327 | 13 | name tag | dog wear | Tippy | 5.99 50009 | 2016-09-02 | 667332 | 12 | collar | dog wear | Tippy | 6.99 50009 | 2016-09-02 | 667332 | 13 | name tag | dog wear | Tippy | 5.99 50010 | 2016-09-02 | 667333 | 14 | jacket | dog wear | Tippy | 20.99
The order detail shown here is derived from combining our orders data with our product catalog data. We first looked at the dog products catalog in our article about pivoting using
CROSSTAB. In case you aren't familiar with the products, here's what the catalog looks like:
id | product | category | product_line | price ---------------------------------------------------------- 1 | leash | dog wear | Bowser | 15.99 2 | collar | dog wear | Bowser | 10.99 3 | name tag | dog wear | Bowser | 5.99 4 | jacket | dog wear | Bowser | 24.99 5 | ball | dog toys | Bowser | 6.99 6 | plushy | dog toys | Bowser | 8.99 7 | rubber bone | dog toys | Bowser | 4.99 8 | rubber bone | dog toys | Tippy | 4.99 9 | plushy | dog toys | Tippy | 6.99 10 | ball | dog toys | Tippy | 2.99 11 | leash | dog wear | Tippy | 12.99 12 | collar | dog wear | Tippy | 6.99 13 | name tag | dog wear | Tippy | 5.99 14 | jacket | dog wear | Tippy | 20.99
But before we go further, let's backtrack a little bit to get you caught up. By running the following query on the orders detail table above, we can get the orders summary table that we worked with in the previous article about mean:
SELECT order_id, COUNT(product_id) as item_count, SUM(price) as order_value FROM orders_detail GROUP BY order_id;
If we use that query with the
WITH clause to make it a CTE (common table expression), we can calculate the mean values we got originally:
WITH orders as ( SELECT order_id, COUNT(product_id) as item_count, SUM(price) as order_value FROM orders_detail GROUP BY order_id ) SELECT ROUND(AVG(NULLIF(item_count,0)),2) AS avg_items, ROUND(AVG(order_value),2) AS avg_value FROM orders;
We'll get a mean value for orders of 19.98 with an average number of items per order of 2.10. Just as we mentioned above.
Because of the one invalid order (50005) that does not have items, we are using the
NULLIF() function for the item count calculation so that the 0 count for that order will be ignored. Note also our use of the
ROUND() function that we learned about in Make Data Pretty to round the results to only 2 decimal places. If you'd like to refresh your memory about any of these functions and calculations before diving into weighted averages with us, check out the article about calculating a mean.
Now let's build on what we learned previously to get even more insight into our business by calculating a weighted average for ordered products.
Using a weighted average calculation allows us to assign more value to certain data elements than others based on some additional criteria when arriving at our final mean value. Looking at examples is the best way to wrap your head around the concept so let's dive right in.
For this article, we're going to look at the average price for ordered products to get more insight into how different products are performing and we're going to use a weighted average price as a proxy for how our business is doing. The higher the average and the weighted average become, the better we're doing at selling our higher-priced items and getting repeat customers. More on this below. First, let's get the average order price for each of the products that were ordered:
SELECT product, ROUND(AVG(price),2) as avg_item_price FROM orders_detail GROUP BY product;
Here we're using the
ROUND() function to round our mean product value to 2 decimal places. Our result looks like this:
product | avg_item_price ------------------------------ (NULL) | (NULL) rubber bone | 4.99 collar | 8.99 jacket | 22.99 leash | 14.99 ball | 4.99 name tag | 5.99 plushy | 8.99
While some of the products from the catalog are the same price regardless of product line, others differ in price. The Bowser line is more expensive for most items than the Tippy line. For example, a Bowser jacket costs 24.99 while a Tippy jacket costs only 20.99. The average price per product that we just returned will accommodate for this since it's calculating every item at each price. For example, 1 Bowser jacket was ordered at 24.99 and 1 Tippy jacket was ordered at 20.99 so the rounded average for those two items is 22.99. If the average order price for jackets begins to go up from this point forward, then we'll know we're selling more Bowser jackets. If it begins to go down, then we're selling more Tippy jackets.
Beware of averaging averages!
We might be tempted now to average the average product prices to get an overall average item price of 10.28:
(4.99 + 8.99 + 22.99 + 14.99 + 4.99 + 5.99 + 8.99) / 7 = 10.28
The problem with averaging our averages is that we lose the impact of the amount of the items that were ordered for each product. In this case, to get an overall average item price, it would be better to go back to square one and just do an average across all the items that were ordered. Also, since we know that one order without any items is invalid, let's just exclude it from our consideration. Here's our query to get the overall average item price (excluding the invalid order that has only NULL values):
SELECT ROUND(AVG(price),2) as avg_item_price FROM orders_detail WHERE price IS NOT NULL ;
Now we can see that our overall average item price is 9.51:
(15.99 + 10.99 + 8.99 + 4.99 + 2.99 + 5.99 + 4.99 + 15.99 + 10.99 + 5.99 + 24.99 + 6.99 + 8.99 + 4.99 + 5.99 + 12.99 + 6.99 + 5.99 + 6.99 + 5.99 + 20.99) / 21 = 9.51
In the future, if we see this overall average price increase, then we'll know that more higher-priced items are being ordered.
The weighting factor
Now let's weight the average prices for each product ordered according to the number of repeat customers represented for it. Our working business assumption is that the more repeat customers a product brings in, the higher that product's importance is to our business. Let's see that in practice. Here's our query:
WITH repeat_customers as ( SELECT customer_id FROM orders_detail WHERE customer_id IS NOT NULL GROUP BY customer_id HAVING COUNT(distinct order_id) > 1 ), products AS ( SELECT od.product, ROUND(AVG(od.price),2) as avg_item_price, COUNT(distinct rc.customer_id) + 1 as weight FROM orders_detail od LEFT JOIN repeat_customers rc ON rc.customer_id = od.customer_id WHERE od.price IS NOT NULL GROUP BY od.product ) SELECT ROUND(SUM(avg_item_price * weight) / SUM(weight),2) as weighted_avg_item_price FROM products;
When we run this, we come back with a weighted average item price of 9.74. Weighting by the number of repeat customers represented for each product has increased our average item price by 0.23. We'd like to see this go even higher in the future because then that would mean we're getting more repeat customers, but it's a place to start. Let's walk through how we got there because the details here are key for us to understand.
The query uses two CTEs.
The first CTE, "repeat_customers", uses
GROUP BY and 'HAVING` to identify only customer IDs with more than one order. We could get more sophisticated with this in time and maybe add a conditional for amount of time that has passed between orders, but for now, we only care that a customer has made more than one order with us.
In the "products" CTE, the query is exactly the same as the one we used to calculate the average above, except that we've added a
COUNT() aggregation by distinct customer ID based on doing a
LEFT JOIN to our repeat customers list. We're also adding a 1 to the final value of that aggregation so that every product will at least have a weight of 1. The results of the CTE look like this:
product | avg_item_price | weight -------------------------------------- rubber bone | 4.99 | 1 collar | 8.99 | 1 jacket | 22.99 | 1 leash | 14.99 | 1 ball | 4.99 | 1 name tag | 5.99 | 2 plushy | 8.99 | 1
As we can see, none of the products have so far gotten repeat customers except for the name tag product. As more repeat customers come back to us and place more orders the weights of the different products will begin to emerge telling us which products are more likely to generate repeat business for us. With this valuable information, we may want to promote those products more.
By using these weights in our calculation for weighted average item price, we are suggesting that the products that have more repeat customers, and thus higher weights, are more valuable to our business regardless of the average order price each product may have. Now we have two ways of seeing improvement in the quality of our orders -- increased average order price for each product and how much weight that product has due to its ability to drive repeat customers.
The final part of our query is where we put the two components together. Here we run the weighted average calculation. Basically it multiplies each value and the determined weight for each product, sums them together, and then divides by the sum of the weights to get the weighted average item price. Take a look at the math for the weighted average:
((4.99 * 1) + (8.99 * 1) + (22.99 * 1) + (14.99 * 1) + (4.99 * 1) + (5.99 * 2) + (8.99 * 1)) / 7 = 9.74
This tells us more about our orders and products and how they're performing than the average product order price alone.
There are many different scenarios and you'll need to decide which criteria make sense for using as weights with your data (or if calculating a weighted average even makes sense for what you want to learn). You may consider ratings or social mentions or any of a number of different options as weighting factors. Just by determining what the weighting factors should be, you've already decided what's important to your business. Now you can apply those to see how your business is performing according to those factors.
While it's easy to rely on a simple mean or to take the average of averages for some of your business metrics, stop and think about whether the metric accurately represents what you're trying to convey. You might be missing some key insights. Sometimes one value should count more than another value and you now know how to use the weighted average calculation to make that happen.
In our next article, we'll take what we've learned here and what we learned previously in our article on moving averages and apply that toward calculating a weighted moving average. Until then...!
If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at firstname.lastname@example.org. We're happy to hear from you.