Metrics Maven: Crosstab Revisited - Pivoting Wisely in PostgreSQL

Published

In our Metrics Maven series, Compose's data scientist shares database features, tips, tricks, and code you can use to get the metrics you need from your data. In this article, we'll take another look at crosstab to help you pivot wisely.

In this article, we'll look again at the crosstab function, focusing this time on the option that does not use category sql. We'll explain how and when (not) to use it. We'll also compare it to the option that does use category sql, which we covered in our previous article on pivot tables using crosstab. You can also find some discussion of both options in the official Postgres documentation for tablefunc.

To use crosstab with Compose PostgreSQL, refer to the previous article for how to enable tablefunc for your deployment.

Pivoting your data

Pivoting your data can sometimes simplify how data is presented, making it more understandable. PostgreSQL provides the crosstab function to help you do that.

The simplest option for crosstab, which we'll focus on in this article, is referred to as crosstab(text sql) in the documentation. We're going to call it the "basic option" in this article. It differs from the crosstab(text source_sql, text category_sql) option in a couple of significant ways, which we'll cover a little later in this article. If you want to learn how the crosstab(text source_sql, text category_sql) option works before diving into the basic option we're going to look at here, check out our article Creating Pivot Tables in PostgreSQL Using Crosstab.

Our data

As we did in the previous article on crosstab, we'll use the product catalog from our hypothetical pet supply company.

id  | product         | category | product_line   | price | number_in_stock  
---------------------------------------------------------------------------
1   | leash           | dog wear | Bowser         | 15.99 | 48  
2   | collar          | dog wear | Bowser         | 10.99 | 76  
3   | name tag        | dog wear | Bowser         | 5.99  | 204  
4   | jacket          | dog wear | Bowser         | 24.99 | 12  
5   | ball            | dog toys | Bowser         | 6.99  | 27  
6   | plushy          | dog toys | Bowser         | 8.99  | 30  
7   | rubber bone     | dog toys | Bowser         | 4.99  | 52  
8   | rubber bone     | dog toys | Tippy          | 4.99  | 38  
9   | plushy          | dog toys | Tippy          | 6.99  | 16  
10  | ball            | dog toys | Tippy          | 2.99  | 47  
11  | leash           | dog wear | Tippy          | 12.99 | 34  
12  | collar          | dog wear | Tippy          | 6.99  | 88  
13  | name tag        | dog wear | Tippy          | 5.99  | 165  
14  | jacket          | dog wear | Tippy          | 20.99 | 50  
15  | rope chew       | dog toys | Bowser         | 7.99  | 27  

We've got one additional item in the catalog than we had last time - a rope chew toy in the Bowser line.

As tends to be the case in a relational database, the data in our table extends downward, repeating values for product_line, category, and product in different combinations for each price and inventory value. We want to create a pivot table to get a simpler view of our catalog.

Let's get started.

Aggregating a value

Let's start by getting the average price of each product category for each of the product lines. This was the same example we used in our previous article, but this time we'll use the basic crosstab option which does not use category sql. Here's what that looks like:

 -- using the basic option
SELECT * FROM crosstab(  
  'select distinct
     product_line,
     category,
     round(avg(price),2) as avg_price
   from catalog
   group by product_line, category
   order by 1,2')
AS catalog(product_line character varying,  
    dog_toys numeric,
    dog_wear numeric
)
;

Let's look at the sub-query first.

The first thing to notice is that the sub-query is encapsulated in single quotes. The query is passed to the crosstab function as a string that it will run. Next, we're using round with the avg function to get the average price rounded to two decimal places for each product line and category combination. If you need a refresher on either of these functions, we covered rounding in our Make Data Pretty article and avg in our article on mean. To get the average aggregate value, we're using group by with the other two columns: product line and category. Finally, we're ordering our results first by product line then by category. The ordering is important because in the outer query, we have to explicitly name the columns we want to see and need to know what order the data will be populated into them.

The outer query calls the crosstab function on the results from the sub-query and then specifies the column names and data types for presenting that pivoted data. In effect, this creates a new table that is presented as the result of the query.

Here's what the result looks like:

product_line | dog_toys_avg_price | dog_wear_avg_price  
-------------------------------------------------------
Bowser       | 7.24               | 14.49  
Tippy        | 4.99               | 11.74  

If you compare this result to the result we got in the previous article, which used the category sql option for crosstab, you'll find they are exactly the same. The only difference here is that the Bowser line of dog toys has increased slightly since then due to the addition of the new rope chew toy.

If that's the case, then you may be wondering what the difference is between the two crosstab options... Let's look into that.

Comparing crosstab options

Before we look at the key differences between the two optons, let's cover a couple caveats that apply to both options.

Similar caveats

As we mentioned above and in the previous article, both options require you to indicate an explicit order for the resulting columns. If you don't order the data, you will have a hodge-podge in your pivoted columns. PostgreSQL has no way of being "smart" here. It does not know how your pivoted columns map to the data you're querying on. You have to know that and, to do that, you need to order the data.

The next probably goes without saying, but let's just go ahead and be extra clear here. The resultant pivoted rows must have only one value for each row. If there can be multiple values, then PostgreSQL will return you one from the list. For example, if we did not average the price in the query above (which aggregates the price to a single value), but instead simply requested the price column, we could get any one of the prices associated with each product category and product line. The point of pivoting the data is to present a single value for each possible combination of attributes.

The pivoted columns' data types must match the data types expected from the source data. For example, we would get an error if we had our pivoted column "avg_price" specified as an int instead of numeric. The result of the avg function on our price values will not produce an int. If we wanted the pivoted column to be an int, we'd need to cast the value accordingly in the sub-query.

Now the differences...

Big differences

The reason our previous article used the category sql option of crosstab is that it is more flexible than the basic option we covered here. We recommend using the category sql option over the basic option. Here's why:

The category sql option allows you to include "extra columns" in your pivot table result. The extra columns are not used for pivoting. The common use for these columns is to provide additional descriptors of the data in each row. You can have as many extra columns as you want; however, there can only be one extra column value for each. As mentioned above in the caveat section, multiple possible values will result in any one of the values being displayed. Here's an example to make this easier to understand:

 -- using category sql option
SELECT * FROM crosstab (  
  'select distinct
    product_line,
    case
        when product_line = 'Bowser' then 'Fashion and fun for big dogs.'
        when product_line = 'Tippy' then 'Small dog fashion and fun.'
        else null
    end as description,
    category,
    round(avg(price),2) as avg_price    
  from catalog
  group by
    product_line,
    category
  order by product_line',

  'select distinct category from catalog order by 1'
 )
 AS (
   product_line character varying,
   description text,
   dog_toys_avg_price numeric,
   dog_wear_avg_price numeric
 )
 ;

In this case, we've added an "extra column" called "description". For this example, we've provided the values manually in a case statement, but another column from the table could also be used if there was a column that contained the additional descriptive data. Note the escaped single quotes (leaving us with two single quotes around each text value) since the sub-query for crosstab needs to be encapsulated in single quotes. We'll get a result like this:

product_line | description                      | dog_toys_avg_price | dog_wear_avg_price  
------------------------------------------------------------------------------------------
Bowser       | Fashion and fun for big dogs.    | 7.24               | 14.49  
Tippy        | Small dog fashion and fun.       | 4.99               | 11.74  

If you try to add an extra column using the basic crosstab option, you'll get this error: "The provided SQL must return 3 columns: rowid, category, and values." No extra columns allowed.

The next difference is the more compelling one to use the category sql crosstab option: it places data in the correct columns when one of the rows is missing a particular value for the specified attribute.

Remember our new dog toy, the rope chew? The Tippy line does not have that toy. If we wanted to pivot by toy products instead of by product categories, we would only be able to get an accurate result using the category sql option of crosstab. Check it out:

 -- using category sql option
SELECT * FROM crosstab(  
  'select distinct
     product_line,
     category,
     product,
     price
   from catalog
   where category = ''dog toys''
   order by 1,2',

  'select distinct product from catalog where category = ''dog toys'' order by 1'
 )
 AS (
   product_line character varying,
   category character varying,
   ball_price numeric,
   plushy_price numeric,
   rope_chew_price numeric,
   rubber_bone_price numeric
)
;

We'll get the result we expect (a null value for the rope chew toy on the Tippy product line row):

product_line | category | ball_price | plushy_price | rope_chew_price | rubber_bone_price  
-------------------------------------------------------------------------------
Bowser       | dog toys | 6.99       | 8.99         | 7.99            | 4.99  
Tippy        | dog toys | 2.99       | 6.99         |                 | 4.99  

Notice in the query above that we did not need to use an aggregation for the price because there is one price per product per product line. We also added the product category as an "extra column" since our pivoted rows were limited to only the category for dog toys - just an additional example of using extra columns for you to "chew" on (pun intended).

If we use the basic option of crosstab to present dog toy prices per product line, not only can we not use any extra columns as we learned above, but worse, we'll get a bad result... Here's the SQL:

 -- using the basic option
SELECT * FROM crosstab(  
  'select distinct
     product_line,
     product,
     price
   from catalog
   where category = ''dog toys''
   order by 1,2')
AS catalog(product_line text,  
    ball_price numeric,
    plushy_price numeric,
    rope_chew_price numeric,
    rubber_bone_price numeric
)
;

And here's the result:

product_line | ball_price | plushy_price | rope_chew_price | rubber_bone_price  
-------------------------------------------------------------------------------
Bowser       | 6.99       | 8.99         | 7.99            | 4.99  
Tippy        | 2.99       | 6.99         | 4.99            |  

WHAT?! The rubber bone price for the Tippy line shifted over to populate the rope chew column! That's because, without the category sql, the basic option does not know how many columns to expect and simply populates the data top-to-bottom, left-to-right until there are no more values. So, you can only use the basic option if your data values have exactly the same number and type. That's a pretty big limiter in our book.

Wrapping up

Hopefully you now have a much more thorough understanding of crosstab in PostgreSQL, including the differences between the two options that are presented in the documentation. You are now armed with the knowledge that will help you pivot wisely.

Image by: herbert2512