In our Metrics Maven series, Compose's data scientist shares database features, tips, tricks, and code you can use to get the metrics you need from your data. In this article, we'll look at the crosstab function in PostgreSQL to create a pivot table of our data with aggregate values.
If you've used spreadsheet software, then you're probably familiar with pivot tables since they're one of the key features of those applications. The same pivot functionality can be applied to data in your database tables.
Typical relational database tables will contain multiple rows, often with repeating values in some columns. In this way, the data extends downward through the table. Aggregate functions and group by options can be applied at query time to determine metrics like count, sum, and average for categories of the data. It's all pretty straightforward, but sometimes having a pivot table that extends the data across, rather than downward, with those metrics at-the-ready makes it easier to do comparisons or to filter on certain attributes. Luckily PostgreSQL has a function for creating pivot tables. It's called
In this article we're going to look at how to use the
crosstab function to output a result set of aggregate values pivoted by category. In our examples below, we'll pivot data from a product catalog, but you'll be able to see how it can be applied to a variety of data situations.
First things first. To run
crosstab we'll need to enable the tablefunc module. Besides
tablefunc module also contains functions for generating random values as well as creating a tree-like hierarchy from table data.
On Compose PostgreSQL, we enable
tablefunc in the Compose administrative console for the Postgres database where we'll run
crosstab. We do this in the data browser by navigating to our database then clicking on the "Extensions" option on the left side:
Once we're on the Extensions page, we just scroll down to "tablefunc" and select "install" from the right side. It will instantly be enabled:
Voila! Now we're ready to
Let the pivoting begin
There are a couple of different crosstab options that you can read about on the tablefunc page in the PostgreSQL documentation and experiment with for your particular situation. We're going to focus on the one that uses source SQL since that fits our use case best:
crosstab(text source_sql, text category_sql)
Let's get to it!
Here's what our example "catalog" table data looks like:
id | product | category | product_line | price | number_in_stock --------------------------------------------------------------------------- 1 | leash | dog wear | Bowser | 15.99 | 48 2 | collar | dog wear | Bowser | 10.99 | 76 3 | name tag | dog wear | Bowser | 5.99 | 204 4 | jacket | dog wear | Bowser | 24.99 | 12 5 | ball | dog toys | Bowser | 6.99 | 27 6 | plushy | dog toys | Bowser | 8.99 | 30 7 | rubber bone | dog toys | Bowser | 4.99 | 52 8 | rubber bone | dog toys | Tippy | 4.99 | 38 9 | plushy | dog toys | Tippy | 6.99 | 16 10 | ball | dog toys | Tippy | 2.99 | 47 11 | leash | dog wear | Tippy | 12.99 | 34 12 | collar | dog wear | Tippy | 6.99 | 88 13 | name tag | dog wear | Tippy | 5.99 | 165 14 | jacket | dog wear | Tippy | 20.99 | 50
Pivoting for one aggregate value
The first thing we want to know from our data is the average price of the products in each category by product line. Typically we'd run a query that uses the
avg aggregate function and
group by to determine this:
select distinct product_line, category, round(avg(price),2) as avg_price from catalog group by product_line, category order by product_line ;
Here's what we'd get back:
product_line | category | avg_price ------------------------------------ Bowser | dog toys | 6.99 Bowser | dog wear | 14.49 Tippy | dog toys | 4.99 Tippy | dog wear | 11.74
Now let's use
crosstab to get a pivoted result set instead:
select * from crosstab ( 'select distinct product_line, category, round(avg(price),2) as avg_price from catalog group by product_line, category order by product_line', 'select distinct category from catalog order by 1' ) AS ( product_line character varying, dog_toys_avg_price numeric, dog_wear_avg_price numeric ) ;
Here we're wrapping our original query in a
select * from crosstab. Note the single quotes around the original query. You'll get a syntax error without them. You might also have noticed our usage of the
round function, which we covered in our previous article on how to make data pretty, in order to round the result to an appropriate number of decimal points - 2 in this case since we're dealing with currency.
We've then added another query (note the comma separating the two queries) to return the distinct categories in the order we're expecting. It's important to know exactly which values (and in which order the pivoted field will return them) so that we can name the new columns correctly. Ideally, the values are not changing often (if ever) since we're doing a bit of hard-coding here. And that brings us to specifying the output column names and data types in an
Our result set now looks like this with the data across the new columns:
product_line | dog_toys_avg_price | dog_wear_avg_price ------------------------------------------------------- Bowser | 6.99 | 14.49 Tippy | 4.99 | 11.74
In this format, we can now easily see that the Tippy product line is, on average, less expensive than the Bowser line. But what if we want to also know the count of items from each line that are still in stock? Let's add that.
Pivoting to get more than one aggregate value
So, you might think that you can just add
sum(number_in_stock) to the query with the corresponding 2 output columns for the total number in stock for dog toys and dog wear, but that will only produce an "invalid return type" error that states "Query-specified return tuple has 5 columns but crosstab returns 4." Huh? It means that there are more fields in the result set than the crosstab expects. That's because
sum(number_in_stock) will be treated as an "extra column" that won't get pivoted since that's how this particular flavor of
crosstab works - you can only pivot one field and one aggregate value at a time. Note that you'll also get this error if the second query that provides the distinct category names ends up having more or less values than you've defined columns for in the output.
While this might seem like a setback, it gives us a chance to be more explicit with our query by using CTEs (common table expressions) to do multiple
crosstab queries and join them together. If you're not familiar with CTEs, check out our article on series, random and with. What we'll do is create two CTEs, each with one of the crosstabbed aggregates we want to generate. Our new query looks like this:
with product_lines_avg_price as ( select * from crosstab ( 'select distinct product_line, category, round(avg(price),2) as avg_price from catalog group by product_line, category order by product_line', 'select distinct category from catalog order by 1' ) AS ( product_line character varying, dog_toys_avg_price numeric, dog_wear_avg_price numeric ) ), product_lines_total_in_stock as ( select * from crosstab ( 'select distinct product_line, category, sum(number_in_stock) as total_in_stock from catalog group by product_line, category order by product_line', 'select distinct category from catalog order by 1' ) AS ( product_line character varying, dog_toys_total_in_stock numeric, dog_wear_total_in_stock numeric ) ) select ap.product_line, ap.dog_toys_avg_price, ap.dog_wear_avg_price, tis.dog_toys_total_in_stock, tis.dog_wear_total_in_stock from product_lines_avg_price ap join product_lines_total_in_stock tis on tis.product_line = ap.product_line ;
We've created two CTEs, one called "product_lines_avg_price" and one called "product_lines_total_in_stock". Each of these is a temporary table that we can select from to create our final pivot table by joining on the unique product_line value from each of them.
Here's what our final table looks like:
product_line | dog_toys_avg_price | dog_wear_avg_price | dog_toys_total_in_stock | dog_wear_total_in_stock ----------------------------------------------------------------------------------------------------------- Bowser | 6.99 | 14.49 | 109 | 340 Tippy | 4.99 | 11.74 | 101 | 337
While the total amounts left in stock for each product line are not too different from each other at this point, the Tippy line has a bit less than the Bowser line. If we started to see the Tippy inventory decrease faster over time than the Bowser inventory, we might consider whether the lower prices for the Tippy line lead to more purchases from it. Imagine if we had dozens of product lines and other data points we wanted to consider besides average price and number in stock. Creating a pivot table like this makes it easy to compare one to another.
If you need to use the pivot often, you may want to consider creating a materialized view of the pivoted data. Users and other applications may find the format more simple to use than what they'd see with more standard query result sets. You may also want to experiment with the other
crosstab options to see how they work for your use case.
Join us next time when we'll get more cozy with mean, median and mode metrics.