Metrics Maven: Window Functions in PostgreSQL

In our Metrics Maven series, Compose's data scientist shares database features, tips, tricks, and code you can use to get the metrics you need from your data. In this first article, we'll look at how to use window functions in PostgreSQL.

PostgreSQL window functions

If you use PostgreSQL, you're probably already familiar with many of the common aggregate functions, such as COUNT(), SUM(), MIN(), MAX(), and AVG(). But you may not be familiar with window functions since they're touted as an advanced feature. Window functions aren't nearly as esoteric as they may seem, however.

As the name implies, window functions provide a "window" into your data, letting you perform aggregations against a set of data rows according to specified criteria that match the current row. While they are similar to standard aggregations, there are also additional functions that can only be used through window functions (such as the RANK() function we'll demonstrate below). In some situations window functions can minimize the complexity of your query or even speed up the performance.

Make note: window functions always use the OVER() clause so if you see OVER() you're looking at a window function. Once you get used to how the OVER() clause is formatted, where it fits in your queries, and the kind of results you can get, you'll soon start to see lots of ways to apply it. Let's dive in!

OVER( )

Depending on the purpose and complexity of the window function you want to run, you can use OVER() all by itself or with a handful of conditional clauses. Let's start by looking at using OVER() all by itself.

If the aggregation you want to run is to be performed across all the rows returned by the query and you don't need to specify any other conditions, then you can use the OVER() clause by itself. Here's an example of a simple window function querying a table in our Compose PostgreSQL database containing the United States Census data on estimated population:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       SUM(popestimate2015)
            OVER() AS national_population
FROM population  
WHERE state > 0 -- only state-level rows  
ORDER BY name  
;

Notice that we're using a window function to sum the state populations over all the result rows (that's the OVER() you see in our query... yep, just that one little addition to an otherwise standard query). Returned, we get result rows for each state and their populations with also the population sum for the nation - that's the aggregation we performed with our window function:

state_name           | state_population | national_population  
--------------------------------------------------------------
Alabama              | 4858979          | 324893002  
Alaska               | 738432           | 324893002  
Arizona              | 6828065          | 324893002  
Arkansas             | 2978204          | 324893002  
California           | 39144818         | 324893002  
Colorado             | 5456574          | 324893002  
Connecticut          | 3590886          | 324893002  
Delaware             | 945934           | 324893002  
District of Columbia | 672228           | 324893002  
Florida              | 20271272         | 324893002  
. . . .

Consider how this compares to standard aggregation functions. Without the window function, the simplest thing we could do is return the national population by itself, like this, by summing the state populations:

SELECT SUM(popestimate2015) AS national_population  
FROM population  
WHERE state > 0 -- only state-level rows  
;

The problem is, we don't get any of the state level information this way. To get the same results as our window function, we'd have to do a sub-select as a derived table:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       x.national_population
FROM population,  
(
  SELECT SUM(popestimate2015) AS national_population
  FROM population
  WHERE state > 0 -- only state-level rows
) x
WHERE state > 0 -- only state-level rows  
ORDER BY name  
;

Looks ugly in comparison, doesn't it? Using window functions, our query is much less complex and easier to understand.

Condition clauses

In the above example, we looked at a simple window function without any additional conditions, but in many cases, you'll want to apply some conditions in the form of additional clauses to your OVER() clause. One is PARTITION BY which acts as the grouping mechanism for aggregations. The other one is ORDER BY which orders the results in the window frame (the set of applicable rows).

So, besides the format of the returned rows as we reviewed above, the other obvious difference with window functions is how the syntax works in your queries: use the OVER() clause with an aggregate function (like SUM() or AVG()) and/or with a specialized window function (like RANK() or ROW_NUMBER()) in your SELECT list to indicate you're creating a window and apply additional conditions as necessary to the OVER() clause, such as using PARTITION BY (instead of the GROUP BY you may be used to for aggregation).

Let's look at some specific examples.

PARTITION BY

PARTITION BY allows us to group aggregations according to the values of the specified fields.

In our census data for estimated population, each state is categorized according to the division and region it belongs to. Let's partition first by region:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       region,
       SUM(popestimate2015)
            OVER(PARTITION BY region) AS regional_population
FROM population  
WHERE state > 0 -- only state-level rows  
ORDER BY name  
;

Now we can see the population sum by region but still get the state level data:

state_name           | state_population | region    | regional_population  
-------------------------------------------------------------------------
Alabama              | 4858979          | South     | 121182847  
Alaska               | 738432           | West      | 76044679  
Arizona              | 6828065          | West      | 76044679  
Arkansas             | 2978204          | South     | 121182847  
California           | 39144818         | West      | 76044679  
Colorado             | 5456574          | West      | 76044679  
Connecticut          | 3590886          | Northeast | 56283891  
Delaware             | 945934           | South     | 121182847  
District of Columbia | 672228           | South     | 121182847  
Florida              | 20271272         | South     | 121182847  
. . . .

Let's add division:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       region,
       division,
       SUM(popestimate2015)
            OVER(PARTITION BY division) AS divisional_population
FROM population  
WHERE state > 0 -- only state-level rows  
ORDER BY name  
;

Now we're looking at state-level data, broken out by region and division, with a population summary at the division level:

state_name           | state_population | region    | division           | divisional_population  
-------------------------------------------------------------------------------------------
Alabama              | 4858979          | South     | East South Central | 18876703  
Alaska               | 738432           | West      | Pacific            | 52514181  
Arizona              | 6828065          | West      | Mountain           | 23530498  
Arkansas             | 2978204          | South     | West South Central | 39029380  
California           | 39144818         | West      | Pacific            | 52514181  
Colorado             | 5456574          | West      | Mountain           | 23530498  
Connecticut          | 3590886          | Northeast | New England        | 14727584  
Delaware             | 945934           | South     | South Atlantic     | 63276764  
District of Columbia | 672228           | South     | South Atlantic     | 63276764  
Florida              | 20271272         | South     | South Atlantic     | 63276764  
. . . .

ORDER BY

As you've probably noticed in the previous queries, we're using ORDER BY in the usual way to order the results by the state name, but we can also use ORDER BY in our OVER() clause to impact the window function calculation. For example, we'd want to use ORDER BY as a condition for the RANK() window function since ranking requires an order to be established. Let's rank the states according to highest population:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       RANK()
            OVER(ORDER BY popestimate2015 desc) AS state_rank
FROM population  
WHERE state > 0 -- only state-level rows  
ORDER BY name  
;

In this case, we've added ORDER BY popestimate2015 desc as a condition of our OVER() clause in order to describe how the ranking should be performed. Because we still have our ORDER BY name clause for our result set, though, our results will continue to be in state name order, but we'll see the populations ranked accordingly with California as the number 1 ranked based on its population:

state_name           | state_population | state_rank  
-----------------------------------------------------
Alabama              | 4858979          | 24  
Alaska               | 738432           | 49  
Arizona              | 6828065          | 14  
Arkansas             | 2978204          | 34  
California           | 39144818         | 1  
Colorado             | 5456574          | 22  
Connecticut          | 3590886          | 29  
Delaware             | 945934           | 46  
District of Columbia | 672228           | 50  
Florida              | 20271272         | 3  
. . . .

Let's combine our PARTITION BY and our ORDER BY window function clauses now to see the ranking of the states by population within each region. For this, we'll change our result-level ORDER BY name clause at the end to order by region instead so that it'll be clear how our window function works:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       region,
       RANK()
           OVER(PARTITION BY region ORDER BY popestimate2015 desc) AS regional_state_rank
FROM population  
WHERE state > 0 -- only state-level rows  
ORDER BY region  
;

Our results:

state_name   | state_population | region        | regional_state_rank  
----------------------------------------------------------------------
Illinois     | 12859995         | Midwest       | 1  
Ohio         | 11613423         | Midwest       | 2  
Michigan     | 9922576          | Midwest       | 3  
Indiana      | 6619680          | Midwest       | 4  
Missouri     | 6083672          | Midwest       | 5  
Wisconsin    | 5771337          | Midwest       | 6  
Minnesota    | 5489594          | Midwest       | 7  
Iowa         | 3123899          | Midwest       | 8  
Kansas       | 2911641          | Midwest       | 9  
Nebraska     | 1896190          | Midwest       | 10  
South Dakota | 858469           | Midwest       | 11  
North Dakota | 756927           | Midwest       | 12  
New York     | 19795791         | Northeast     | 1  
Pennsylvania | 12802503         | Northeast     | 2  
New Jersey   | 8958013          | Northeast     | 3  
. . . .

Here we can see that Illinois is the number 1 ranking state by population in the Midwest region and New York is number 1 in the Northeast region.

So, we combined some conditions here, but what if we need more than one window function? Read on...

Named window functions

In queries where you are using the same window function logic for more than one returned field or where you need to use more than one window function definition, you can name them to make your query more readable.

Here's an example where we've defined two windows functions. One, named "rw", partitions by region and the other, named "dw", partitions by division. We're using each one twice - once to calculate the population sum and again to calculate the population average. Our windows functions are defined and named using the WINDOW clause which comes after the WHERE clause in our query:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       region,
       SUM(popestimate2015)
            OVER rw AS regional_population,
       AVG(popestimate2015)
            OVER rw AS avg_regional_state_population,
       division,
       SUM(popestimate2015)
            OVER dw AS divisional_population,
       AVG(popestimate2015)
            OVER dw AS avg_divisional_state_population
FROM population  
WHERE state > 0 -- only state-level rows  
WINDOW rw AS (PARTITION BY region),  
       dw AS (PARTITION BY division)
ORDER BY name  
;

Since we didn't do any manipulation on the averages values yet, the numbers look a little crazy, but that can be easily cleaned up using ROUND() and CAST() if need be. Our purpose here is to demonstrate how to use multiple window functions and the results you'll get. Check it out:

state_name           | state_population | region    | regional_population | avg_regional_state_population | division           | divisional_population | avg_divisional_state_population  
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Alabama              | 4858979          | South     | 121182847           | 7128402.764705882353          | East South Central | 18876703              | 4719175.750000000000  
Alaska               | 738432           | West      | 76044679            | 5849590.692307692308          | Pacific            | 52514181              | 10502836.200000000000  
Arizona              | 6828065          | West      | 76044679            | 5849590.692307692308          | Mountain           | 23530498              | 2941312.250000000000  
Arkansas             | 2978204          | South     | 121182847           | 7128402.764705882353          | West South Central | 39029380              | 9757345.000000000000  
California           | 39144818         | West      | 76044679            | 5849590.692307692308          | Pacific            | 52514181              | 10502836.200000000000  
Colorado             | 5456574          | West      | 76044679            | 5849590.692307692308          | Mountain           | 23530498              | 2941312.250000000000  
Connecticut          | 3590886          | Northeast | 56283891            | 6253765.666666666667          | New England        | 14727584              | 2454597.333333333333  
Delaware             | 945934           | South     | 121182847           | 7128402.764705882353          | South Atlantic     | 63276764              | 7030751.555555555556  
District of Columbia | 672228           | South     | 121182847           | 7128402.764705882353          | South Atlantic     | 63276764              | 7030751.555555555556  
Florida              | 20271272         | South     | 121182847           | 7128402.764705882353          | South Atlantic     | 63276764              | 7030751.555555555556  
. . . .

Now that's an informative report of population metrics... and window functions made it easy!

Wrapping up

This article has given you a glimpse of the power of PostgreSQL window functions. We touched on the benefits of using window functions, looked at how they are different (and similar) to standard aggregation functions, and learned how to use them with various conditional clauses, walking through examples along the way. Now that you can see how window functions work, start trying them out by replacing standard aggregations with window functions in your queries. Once you get the hang of them you'll be hooked.

In our next article we'll look at window framing options in PostgreSQL to give you even more control over how your window functions behave.