In our Metrics Maven series, Compose's data scientist shares database features, tips, tricks, and code you can use to get the metrics you need from your data. In this article, we'll take another look at crosstab to help you pivot wisely.
In this article, we'll look again at the
crosstab function, focusing this time on the option that does not use category sql. We'll explain how and when (not) to use it. We'll also compare it to the option that does use category sql, which we covered in our previous article on pivot tables using crosstab. You can also find some discussion of both options in the official Postgres documentation for
crosstab with Compose PostgreSQL, refer to the previous article for how to enable
tablefunc for your deployment.
Pivoting your data
Pivoting your data can sometimes simplify how data is presented, making it more understandable. PostgreSQL provides the
crosstab function to help you do that.
The simplest option for
crosstab, which we'll focus on in this article, is referred to as
crosstab(text sql) in the documentation. We're going to call it the "basic option" in this article. It differs from the
crosstab(text source_sql, text category_sql) option in a couple of significant ways, which we'll cover a little later in this article. If you want to learn how the
crosstab(text source_sql, text category_sql) option works before diving into the basic option we're going to look at here, check out our article Creating Pivot Tables in PostgreSQL Using Crosstab.
As we did in the previous article on
crosstab, we'll use the product catalog from our hypothetical pet supply company.
id | product | category | product_line | price | number_in_stock --------------------------------------------------------------------------- 1 | leash | dog wear | Bowser | 15.99 | 48 2 | collar | dog wear | Bowser | 10.99 | 76 3 | name tag | dog wear | Bowser | 5.99 | 204 4 | jacket | dog wear | Bowser | 24.99 | 12 5 | ball | dog toys | Bowser | 6.99 | 27 6 | plushy | dog toys | Bowser | 8.99 | 30 7 | rubber bone | dog toys | Bowser | 4.99 | 52 8 | rubber bone | dog toys | Tippy | 4.99 | 38 9 | plushy | dog toys | Tippy | 6.99 | 16 10 | ball | dog toys | Tippy | 2.99 | 47 11 | leash | dog wear | Tippy | 12.99 | 34 12 | collar | dog wear | Tippy | 6.99 | 88 13 | name tag | dog wear | Tippy | 5.99 | 165 14 | jacket | dog wear | Tippy | 20.99 | 50 15 | rope chew | dog toys | Bowser | 7.99 | 27
We've got one additional item in the catalog than we had last time - a rope chew toy in the Bowser line.
As tends to be the case in a relational database, the data in our table extends downward, repeating values for product_line, category, and product in different combinations for each price and inventory value. We want to create a pivot table to get a simpler view of our catalog.
Let's get started.
Aggregating a value
Let's start by getting the average price of each product category for each of the product lines. This was the same example we used in our previous article, but this time we'll use the basic
crosstab option which does not use category sql. Here's what that looks like:
-- using the basic option SELECT * FROM crosstab( 'select distinct product_line, category, round(avg(price),2) as avg_price from catalog group by product_line, category order by 1,2') AS catalog(product_line character varying, dog_toys numeric, dog_wear numeric ) ;
Let's look at the sub-query first.
The first thing to notice is that the sub-query is encapsulated in single quotes. The query is passed to the crosstab function as a string that it will run. Next, we're using
round with the
avg function to get the average price rounded to two decimal places for each product line and category combination. If you need a refresher on either of these functions, we covered rounding in our Make Data Pretty article and
avg in our article on mean. To get the average aggregate value, we're using
group by with the other two columns: product line and category. Finally, we're ordering our results first by product line then by category. The ordering is important because in the outer query, we have to explicitly name the columns we want to see and need to know what order the data will be populated into them.
The outer query calls the
crosstab function on the results from the sub-query and then specifies the column names and data types for presenting that pivoted data. In effect, this creates a new table that is presented as the result of the query.
Here's what the result looks like:
product_line | dog_toys_avg_price | dog_wear_avg_price ------------------------------------------------------- Bowser | 7.24 | 14.49 Tippy | 4.99 | 11.74
If you compare this result to the result we got in the previous article, which used the category sql option for
crosstab, you'll find they are exactly the same. The only difference here is that the Bowser line of dog toys has increased slightly since then due to the addition of the new rope chew toy.
If that's the case, then you may be wondering what the difference is between the two
crosstab options... Let's look into that.
Before we look at the key differences between the two optons, let's cover a couple caveats that apply to both options.
As we mentioned above and in the previous article, both options require you to indicate an explicit order for the resulting columns. If you don't order the data, you will have a hodge-podge in your pivoted columns. PostgreSQL has no way of being "smart" here. It does not know how your pivoted columns map to the data you're querying on. You have to know that and, to do that, you need to order the data.
The next probably goes without saying, but let's just go ahead and be extra clear here. The resultant pivoted rows must have only one value for each row. If there can be multiple values, then PostgreSQL will return you one from the list. For example, if we did not average the price in the query above (which aggregates the price to a single value), but instead simply requested the price column, we could get any one of the prices associated with each product category and product line. The point of pivoting the data is to present a single value for each possible combination of attributes.
The pivoted columns' data types must match the data types expected from the source data. For example, we would get an error if we had our pivoted column "avg_price" specified as an
int instead of
numeric. The result of the
avg function on our price values will not produce an
int. If we wanted the pivoted column to be an
int, we'd need to cast the value accordingly in the sub-query.
Now the differences...
The reason our previous article used the category sql option of
crosstab is that it is more flexible than the basic option we covered here. We recommend using the category sql option over the basic option. Here's why:
The category sql option allows you to include "extra columns" in your pivot table result. The extra columns are not used for pivoting. The common use for these columns is to provide additional descriptors of the data in each row. You can have as many extra columns as you want; however, there can only be one extra column value for each. As mentioned above in the caveat section, multiple possible values will result in any one of the values being displayed. Here's an example to make this easier to understand:
-- using category sql option SELECT * FROM crosstab ( 'select distinct product_line, case when product_line = 'Bowser' then 'Fashion and fun for big dogs.' when product_line = 'Tippy' then 'Small dog fashion and fun.' else null end as description, category, round(avg(price),2) as avg_price from catalog group by product_line, category order by product_line', 'select distinct category from catalog order by 1' ) AS ( product_line character varying, description text, dog_toys_avg_price numeric, dog_wear_avg_price numeric ) ;
In this case, we've added an "extra column" called "description". For this example, we've provided the values manually in a case statement, but another column from the table could also be used if there was a column that contained the additional descriptive data. Note the escaped single quotes (leaving us with two single quotes around each text value) since the sub-query for
crosstab needs to be encapsulated in single quotes. We'll get a result like this:
product_line | description | dog_toys_avg_price | dog_wear_avg_price ------------------------------------------------------------------------------------------ Bowser | Fashion and fun for big dogs. | 7.24 | 14.49 Tippy | Small dog fashion and fun. | 4.99 | 11.74
If you try to add an extra column using the basic crosstab option, you'll get this error: "The provided SQL must return 3 columns: rowid, category, and values." No extra columns allowed.
The next difference is the more compelling one to use the category sql
crosstab option: it places data in the correct columns when one of the rows is missing a particular value for the specified attribute.
Remember our new dog toy, the rope chew? The Tippy line does not have that toy. If we wanted to pivot by toy products instead of by product categories, we would only be able to get an accurate result using the category sql option of
crosstab. Check it out:
-- using category sql option SELECT * FROM crosstab( 'select distinct product_line, category, product, price from catalog where category = ''dog toys'' order by 1,2', 'select distinct product from catalog where category = ''dog toys'' order by 1' ) AS ( product_line character varying, category character varying, ball_price numeric, plushy_price numeric, rope_chew_price numeric, rubber_bone_price numeric ) ;
We'll get the result we expect (a null value for the rope chew toy on the Tippy product line row):
product_line | category | ball_price | plushy_price | rope_chew_price | rubber_bone_price ------------------------------------------------------------------------------- Bowser | dog toys | 6.99 | 8.99 | 7.99 | 4.99 Tippy | dog toys | 2.99 | 6.99 | | 4.99
Notice in the query above that we did not need to use an aggregation for the price because there is one price per product per product line. We also added the product category as an "extra column" since our pivoted rows were limited to only the category for dog toys - just an additional example of using extra columns for you to "chew" on (pun intended).
If we use the basic option of
crosstab to present dog toy prices per product line, not only can we not use any extra columns as we learned above, but worse, we'll get a bad result... Here's the SQL:
-- using the basic option SELECT * FROM crosstab( 'select distinct product_line, product, price from catalog where category = ''dog toys'' order by 1,2') AS catalog(product_line text, ball_price numeric, plushy_price numeric, rope_chew_price numeric, rubber_bone_price numeric ) ;
And here's the result:
product_line | ball_price | plushy_price | rope_chew_price | rubber_bone_price ------------------------------------------------------------------------------- Bowser | 6.99 | 8.99 | 7.99 | 4.99 Tippy | 2.99 | 6.99 | 4.99 |
WHAT?! The rubber bone price for the Tippy line shifted over to populate the rope chew column! That's because, without the category sql, the basic option does not know how many columns to expect and simply populates the data top-to-bottom, left-to-right until there are no more values. So, you can only use the basic option if your data values have exactly the same number and type. That's a pretty big limiter in our book.
Hopefully you now have a much more thorough understanding of
crosstab in PostgreSQL, including the differences between the two options that are presented in the documentation. You are now armed with the knowledge that will help you pivot wisely.