Metrics Maven: Window Frames in PostgreSQL

In our Metrics Maven series, Compose's data scientist shares database features, tips, tricks, and code you can use to get the metrics you need from your data. In this article, we'll take a closer look at window frames in PostgreSQL.

In our last article, we learned about using window functions in PostgreSQL. With that as our foundation, we'll dive a little deeper here to learn about window frames. Understanding window frames is important because the results you get from your window functions depend on how your windows are framed.

Window frames are used to indicate how many rows around the current row the window function should include. Think of them as a way to zoom-in with your window function to focus on specific rows. The basic syntax for specifying a window frame is to use either the RANGE or ROWS indicator (we cover both below) and a BETWEEN clause specifying the start of the frame and the end of the frame. There are a few different options to understand with window frames so let's start by exploring the default behavior.

UNBOUNDED PRECEDING

Window frames in window functions use UNBOUNDED PRECEDING by default, more accurately RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, when an ORDER BY is specified. What we'll see is that the window function will be inclusive of the current row being processed and all previous rows according to the order indicated by the ORDER BY clause. To understand this, let's perform a SUM() aggregation of state populations from the US Census, which you may remember from our last article's examples:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       SUM(popestimate2015)
            OVER() AS national_population
FROM population  
WHERE state > 0 -- only state-level rows  
ORDER BY name  
;

In the above query, our window function is performing the SUM of the state populations over all the rows since we did not indicate any conditions. Note that our ORDER BY clause is not part of the window function; it's being performed against the result set after the window function has been completed. Our results:

state_name           | state_population | national_population  
--------------------------------------------------------------
Alabama              | 4858979          | 324893002  
Alaska               | 738432           | 324893002  
Arizona              | 6828065          | 324893002  
Arkansas             | 2978204          | 324893002  
California           | 39144818         | 324893002  
Colorado             | 5456574          | 324893002  
Connecticut          | 3590886          | 324893002  
Delaware             | 945934           | 324893002  
District of Columbia | 672228           | 324893002  
Florida              | 20271272         | 324893002  
. . . .

Now, however, if we move the ORDER BY clause so that it becomes a condition applied to the window function, you'll see we get something quite different:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       SUM(popestimate2015)
            OVER(ORDER BY name) AS national_population
FROM population  
WHERE state > 0 -- only state-level rows  
;

In these results, we can see that the aggregation was performed against the current row and all of the previous rows to it (the default behavior for window frames) so that the "national_population" column is now a running subtotal as each new row gets added to the previous ones:

state_name           | state_population | national_population  
--------------------------------------------------------------
Alabama              | 4858979          | 4858979  
Alaska               | 738432           | 5597411  
Arizona              | 6828065          | 12425476  
Arkansas             | 2978204          | 15403680  
California           | 39144818         | 54548498  
Colorado             | 5456574          | 60005072  
Connecticut          | 3590886          | 63595958  
Delaware             | 945934           | 64541892  
District of Columbia | 672228           | 65214120  
Florida              | 20271272         | 85485392  
. . . .

Because we've added an ORDER BY clause as a condition of our window function, the query is actually using RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. If we were to be explicit about it in the query, it'd look like this:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       SUM(popestimate2015)
            OVER(ORDER BY name RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS national_population
FROM population  
WHERE state > 0 -- only state-level rows  
;

Understandably, it's a bit of a mind bend when you first encounter this functionality. At the same time, a whole new world of possibilities opens up for reporting and analytics.

Now that we know the default behavior, let's have a look at some other options.

PRECEDING

While the default behavior for window frames uses an UNBOUNDED PRECEDING, it is also possible to specify how many rows preceding from the current row we'd like to include. Let's say we only want to include the previous 2 rows with the current one. Our new query would look like this:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       SUM(popestimate2015)
            OVER(ORDER BY name ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS national_population
FROM population  
WHERE state > 0 -- only state-level rows  
;

Notice two changes we've made here: instead of RANGE we're using ROWS because RANGE can only be used with UNBOUNDED and also we've swapped out UNBOUNDED for the value of 2 to indicate only 2 rows preceding. Our results will now show the SUM of state populations including only the current row and the 2 previous rows:

state_name           | state_population | national_population  
--------------------------------------------------------------
Alabama              | 4858979          | 4858979  
Alaska               | 738432           | 5597411  
Arizona              | 6828065          | 12425476  
Arkansas             | 2978204          | 10544701  
California           | 39144818         | 48951087  
Colorado             | 5456574          | 47579596  
Connecticut          | 3590886          | 48192278  
Delaware             | 945934           | 9993394  
District of Columbia | 672228           | 5209048  
Florida              | 20271272         | 21889434  
. . . .

For example, if we look at the "national_population" value for Colorado, it's 47579596. That number is the sum of state populations for Colorado (5456574), California (39144818), and Arkansas (2978204). So, while this calculation probably doesn't make a whole lot of sense for determining the national population, it can be useful in data sets where adjacency or sequence is a key factor, such as one with geographic coordinates defining nearby areas, a social network, or time series event data. We'll keep our example consistent here, though, to make learning the concepts easy.

Now we've seen how changing the value of PRECEDING impacts our results. We can do the same thing using FOLLOWING.

FOLLOWING

In the examples above, our window function has been run against the current row and some amount of preceding rows. Let's flip that and start at the current row with some amount of following rows instead. Let's start with UNBOUNDED FOLLOWING:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       SUM(popestimate2015)
            OVER(ORDER BY name RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS national_population
FROM population  
WHERE state > 0 -- only state-level rows  
;

Now in our query, we're starting with the current row and performing our window function against it and all the following rows. This will give us a reverse tally of the population sums from what we saw with UNBOUNDED PRECEDING. Take a look:

state_name           | state_population | national_population  
--------------------------------------------------------------
Alabama              | 4858979          | 324893002  
Alaska               | 738432           | 320034023  
Arizona              | 6828065          | 319295591  
Arkansas             | 2978204          | 312467526  
California           | 39144818         | 309489322  
Colorado             | 5456574          | 270344504  
Connecticut          | 3590886          | 264887930  
Delaware             | 945934           | 261297044  
District of Columbia | 672228           | 260351110  
Florida              | 20271272         | 259678882  
. . . .

Here we can see that we actually start out with the full national population value of 324893002 because we're starting with the current row (Alabama in this case) and summing all the following rows with it. As we move to Alaska, Alabama's population isn't included anymore. Instead now we're summing the population for Alaska and all the following rows, and so on.

Just as we did with PRECEDING, we can also supply a value of following rows to include rather than do UNBOUNDED. Let's use 2 again to keep things simple:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       SUM(popestimate2015)
            OVER(ORDER BY name ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) AS national_population
FROM population  
WHERE state > 0 -- only state-level rows  
;

Again, we have to switch from the RANGE indicator to ROWS since RANGE can only be used with UNBOUNDED. We've also set 2 as the number of following rows instead of UNBOUNDED. Here's what we get:

state_name           | state_population | national_population  
--------------------------------------------------------------
Alabama              | 4858979          | 12425476  
Alaska               | 738432           | 10544701  
Arizona              | 6828065          | 48951087  
Arkansas             | 2978204          | 47579596  
California           | 39144818         | 48192278  
Colorado             | 5456574          | 9993394  
Connecticut          | 3590886          | 5209048  
Delaware             | 945934           | 21889434  
District of Columbia | 672228           | 31158360  
Florida              | 20271272         | 31917735  
. . . .

Let's look at Colorado again as our example to understand what's happening in these results. The "national_population" value for Colorado is 9993394. It includes the current row of Colorado's population (5456574) and the two following rows of Connecticut (3590886) and Delaware (945934). As we mentioned above, this kind of aggregation functionality makes more sense to use in a data set where adjacency or sequence matters, but it's easy enough for us to understand what's happening by using this population data set that we're familiar with.

BETWEEN x PRECEDING AND y FOLLOWING

Finally, we can combine PRECEDING and FOLLOWING to specify exactly how many rows before the current row and how many rows after the current row to include for the window function. Let's use 2 in either direction:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       SUM(popestimate2015)
            OVER(ORDER BY name ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS national_population
FROM population  
WHERE state > 0 -- only state-level rows  
;

And our results:

state_name           | state_population | national_population  
--------------------------------------------------------------
Alabama              | 4858979          | 12425476  
Alaska               | 738432           | 15403680  
Arizona              | 6828065          | 54548498  
Arkansas             | 2978204          | 55146093  
California           | 39144818         | 57998547  
Colorado             | 5456574          | 52116416  
Connecticut          | 3590886          | 49810440  
Delaware             | 945934           | 30936894  
District of Columbia | 672228           | 35695180  
Florida              | 20271272         | 33535897  
. . . .

Looking at Colorado again, we see that the value for the "national_population" column is 52116416. That's the sum of Colorado (5456574) as the current row plus two the rows preceding, California (39144818) and Arkansas (2978204), and the two rows following, Connecticut (3590886) and Delaware (945934)

Remember at the beginning we noted that if we're not including an ORDER BY clause in our window function, then by default, the aggregation is performed over all the rows, that is RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. It's the exact same thing as our original query, but because we now have an ORDER BY clause in our window function, let's state it explicitly to demonstrate the behavior:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       SUM(popestimate2015)
            OVER(ORDER BY name RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS national_population
FROM population  
WHERE state > 0 -- only state-level rows  
;

And the results we get are the sum of all the rows to give us the national population:

state_name           | state_population | national_population  
--------------------------------------------------------------
Alabama              | 4858979          | 324893002  
Alaska               | 738432           | 324893002  
Arizona              | 6828065          | 324893002  
Arkansas             | 2978204          | 324893002  
California           | 39144818         | 324893002  
Colorado             | 5456574          | 324893002  
Connecticut          | 3590886          | 324893002  
Delaware             | 945934           | 324893002  
District of Columbia | 672228           | 324893002  
Florida              | 20271272         | 324893002  
. . . .

Now, we're back where we started!

Partitioning

As we saw in our previous article, using the PARTITION BY condition in our window function effectively groups the results according to the values in the field we're partitioning by. In one example from that article, we partitioned by region for each state, so we saw states grouped to "West" or "South" or "Northeast" regions, for example. We can do the same thing here, but also include a window frame specification. Here's and example:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       region,
       SUM(popestimate2015)
            OVER(PARTITION BY region ORDER BY name RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS regional_population
FROM population  
WHERE state > 0 -- only state-level rows  
;

Notice, we've added the "region" column and we've also added the PARTITION BY condition for region to our window function. We're using RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW to be explicit about what we're doing, but because that is the default behavior for a window frame, we don't need to actually include it. Here are the results we get:

state_name    | state_population | region    | regional_population  
-------------------------------------------------------------------------
Illinois      | 12859995         | Midwest   | 12859995  
Indiana       | 6619680          | Midwest   | 19479675  
Iowa          | 3123899          | Midwest   | 22603574  
Kansas        | 2911641          | Midwest   | 25515215  
Michigan      | 9922576          | Midwest   | 35437791  
Minnesota     | 5489594          | Midwest   | 40927385  
Missouri      | 6083672          | Midwest   | 47011057  
Nebraska      | 1896190          | Midwest   | 48907247  
North Dakota  | 756927           | Midwest   | 49664174  
Ohio          | 11613423         | Midwest   | 61277597  
South Dakota  | 858469           | Midwest   | 62136066  
Wisconsin     | 5771337          | Midwest   | 67907403  
Connecticut   | 3590886          | Northeast | 3590886  
Maine         | 1329328          | Northeast | 4920214  
Massachusetts | 6794422          | Northeast | 11714636  
. . . .

You are probably asking yourself, "What just happened? Where did Alabama go?" In this case, because we are using PARTITION BY, the results are first grouped by the partition field values in order, then by the ORDER BY clause we specified which is the state name. So, with the Midwest region, our states are listed in alphabetical order. Also, because we used RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, we're getting the running tally of the sum of the state populations within just that region. So, for example, if we look at Kansas, the "regional_population" column has a value of 25515215. That includes the state population of Kansas (2911641) and all of the preceding rows, Iowa (3123899), Indiana (6619680), and Illinois (12859995). Once the Midwest is completed, we're on to the next partition for the Northeast. Massachusetts is showing the "regional_population" as 11714636, which includes the population for Massachusetts (6794422) and the preceding rows in the Northeast, Maine (1329328) and Connecticut (3590886).

Same functionality we saw in our examples above, but now just grouped according to partition.

A Final Note about Range vs Rows

As we mentioned a couple times, the RANGE indicator can only be used with UNBOUNDED, while ROWS can actually be used for all of the options we discussed here. Besides that, they actually treat aggregations differently in a very small but important way: if the field you use for ORDER BY does not contain unique values for each row, then RANGE will combine all the rows it comes across for non-unique values rather than processing them one at a time whereas ROWS will include all of the rows in the non-unique bunch but process each of them separately. Our examples have all ordered by state name, which are unique to each row in our data set, so it didn't matter for us. However, if you order by date and your data set has more than one row per date, for example, you'll want to make sure you choose between using ROWS and RANGE appropriately for your intended results.

To demonstrate this, let's make a small change to the PARTITION BY query we used above. In this case, all we're going to do is change our ORDER BY clause to order by region rather than by state name. Because region has non-unique values for each row, RANGE will combine them. Take a look:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       region,
       SUM(popestimate2015)
            OVER(PARTITION BY region ORDER BY region RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS regional_population
FROM population  
WHERE state > 0 -- only state-level rows  
;

We'll see that the state names are in random order, but that the regional sum is now the combination of all the rows of the region:

state_name    | state_population | region    | regional_population  
-------------------------------------------------------------------------
North Dakota  | 756927           | Midwest   | 67907403  
Indiana       | 6619680          | Midwest   | 67907403  
Michigan      | 9922576          | Midwest   | 67907403  
Ohio          | 11613423         | Midwest   | 67907403  
Wisconsin     | 5771337          | Midwest   | 67907403  
Iowa          | 3123899          | Midwest   | 67907403  
Kansas        | 2911641          | Midwest   | 67907403  
Minnesota     | 5489594          | Midwest   | 67907403  
Missouri      | 6083672          | Midwest   | 67907403  
Nebraska      | 1896190          | Midwest   | 67907403  
South Dakota  | 858469           | Midwest   | 67907403  
Illinois      | 12859995         | Midwest   | 67907403  
Connecticut   | 3590886          | Northeast | 56283891  
Maine         | 1329328          | Northeast | 56283891  
Massachusetts | 6794422          | Northeast | 56283891  
. . . .

In this particular example, it amounts to the same thing as summing across all the rows in the partition so we might as well not even have the conditions for ORDER BY or RANGE in there. In cases without a partition, the combination of the rows with non-unique values may not be what you're looking for so be aware of this behavior.

Let's see how using ROWS changes the results for that query:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       region,
       SUM(popestimate2015)
            OVER(PARTITION BY region ORDER BY region ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS regional_population
FROM population  
WHERE state > 0 -- only state-level rows  
;

Now we've got the running tally again as we saw above, with the total regional population of the Midwest eventually being calculated as 67907403. We just don't have our states in any particular order:

state_name    | state_population | region    | regional_population  
-------------------------------------------------------------------------
North Dakota  | 756927           | Midwest   | 756927  
Indiana       | 6619680          | Midwest   | 7376607  
Michigan      | 9922576          | Midwest   | 17299183  
Ohio          | 11613423         | Midwest   | 28912606  
Wisconsin     | 5771337          | Midwest   | 34683943  
Iowa          | 3123899          | Midwest   | 37807842  
Kansas        | 2911641          | Midwest   | 40719483  
Minnesota     | 5489594          | Midwest   | 46209077  
Missouri      | 6083672          | Midwest   | 52292749  
Nebraska      | 1896190          | Midwest   | 54188939  
South Dakota  | 858469           | Midwest   | 55047408  
Illinois      | 12859995         | Midwest   | 67907403  
Connecticut   | 3590886          | Northeast | 3590886  
Maine         | 1329328          | Northeast | 4920214  
Massachusetts | 6794422          | Northeast | 11714636  
. . . .

Wrapping Up

In this article, we built on top of what we'd learned previously about using window functions in PostgreSQL by learning how to use window frame options to get different results. For reference, the PostgreSQL documentation has a good little section on window frames in the chapter about SQL Syntax as well as the official writeup about window functions.

In our next Metrics Maven article, we'll apply what we've learned here to calculate a moving average.

Make sure you're getting the latest from Compose by subscribing to Compose Articles or following us on Twitter, Facebook or Google+ by using the links below.