Metrics Maven: Window Frames in PostgreSQL
PublishedIn our Metrics Maven series, Compose's data scientist shares database features, tips, tricks, and code you can use to get the metrics you need from your data. In this article, we'll take a closer look at window frames in PostgreSQL.
In our last article, we learned about using window functions in PostgreSQL. With that as our foundation, we'll dive a little deeper here to learn about window frames. Understanding window frames is important because the results you get from your window functions depend on how your windows are framed.
Window frames are used to indicate how many rows around the current row the window function should include. Think of them as a way to zoom-in with your window function to focus on specific rows. The basic syntax for specifying a window frame is to use either the RANGE
or ROWS
indicator (we cover both below) and a BETWEEN
clause specifying the start of the frame and the end of the frame. There are a few different options to understand with window frames so let's start by exploring the default behavior.
UNBOUNDED PRECEDING
Window frames in window functions use UNBOUNDED PRECEDING
by default, more accurately RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
, when an ORDER BY
is specified. What we'll see is that the window function will be inclusive of the current row being processed and all previous rows according to the order indicated by the ORDER BY
clause. To understand this, let's perform a SUM()
aggregation of state populations from the US Census, which you may remember from our last article's examples:
SELECT name AS state_name,
popestimate2015 AS state_population,
SUM(popestimate2015)
OVER() AS national_population
FROM population
WHERE state > 0 -- only state-level rows
ORDER BY name
;
In the above query, our window function is performing the SUM
of the state populations over all the rows since we did not indicate any conditions. Note that our ORDER BY
clause is not part of the window function; it's being performed against the result set after the window function has been completed. Our results:
state_name | state_population | national_population
--------------------------------------------------------------
Alabama | 4858979 | 324893002
Alaska | 738432 | 324893002
Arizona | 6828065 | 324893002
Arkansas | 2978204 | 324893002
California | 39144818 | 324893002
Colorado | 5456574 | 324893002
Connecticut | 3590886 | 324893002
Delaware | 945934 | 324893002
District of Columbia | 672228 | 324893002
Florida | 20271272 | 324893002
. . . .
Now, however, if we move the ORDER BY
clause so that it becomes a condition applied to the window function, you'll see we get something quite different:
SELECT name AS state_name,
popestimate2015 AS state_population,
SUM(popestimate2015)
OVER(ORDER BY name) AS national_population
FROM population
WHERE state > 0 -- only state-level rows
;
In these results, we can see that the aggregation was performed against the current row and all of the previous rows to it (the default behavior for window frames) so that the "national_population" column is now a running subtotal as each new row gets added to the previous ones:
state_name | state_population | national_population
--------------------------------------------------------------
Alabama | 4858979 | 4858979
Alaska | 738432 | 5597411
Arizona | 6828065 | 12425476
Arkansas | 2978204 | 15403680
California | 39144818 | 54548498
Colorado | 5456574 | 60005072
Connecticut | 3590886 | 63595958
Delaware | 945934 | 64541892
District of Columbia | 672228 | 65214120
Florida | 20271272 | 85485392
. . . .
Because we've added an ORDER BY
clause as a condition of our window function, the query is actually using RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
. If we were to be explicit about it in the query, it'd look like this:
SELECT name AS state_name,
popestimate2015 AS state_population,
SUM(popestimate2015)
OVER(ORDER BY name RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS national_population
FROM population
WHERE state > 0 -- only state-level rows
;
Understandably, it's a bit of a mind bend when you first encounter this functionality. At the same time, a whole new world of possibilities opens up for reporting and analytics.
Now that we know the default behavior, let's have a look at some other options.
PRECEDING
While the default behavior for window frames uses an UNBOUNDED PRECEDING
, it is also possible to specify how many rows preceding from the current row we'd like to include. Let's say we only want to include the previous 2 rows with the current one. Our new query would look like this:
SELECT name AS state_name,
popestimate2015 AS state_population,
SUM(popestimate2015)
OVER(ORDER BY name ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS national_population
FROM population
WHERE state > 0 -- only state-level rows
;
Notice two changes we've made here: instead of RANGE
we're using ROWS
because RANGE
can only be used with UNBOUNDED
and also we've swapped out UNBOUNDED
for the value of 2 to indicate only 2 rows preceding. Our results will now show the SUM
of state populations including only the current row and the 2 previous rows:
state_name | state_population | national_population
--------------------------------------------------------------
Alabama | 4858979 | 4858979
Alaska | 738432 | 5597411
Arizona | 6828065 | 12425476
Arkansas | 2978204 | 10544701
California | 39144818 | 48951087
Colorado | 5456574 | 47579596
Connecticut | 3590886 | 48192278
Delaware | 945934 | 9993394
District of Columbia | 672228 | 5209048
Florida | 20271272 | 21889434
. . . .
For example, if we look at the "national_population" value for Colorado, it's 47579596. That number is the sum of state populations for Colorado (5456574), California (39144818), and Arkansas (2978204). So, while this calculation probably doesn't make a whole lot of sense for determining the national population, it can be useful in data sets where adjacency or sequence is a key factor, such as one with geographic coordinates defining nearby areas, a social network, or time series event data. We'll keep our example consistent here, though, to make learning the concepts easy.
Now we've seen how changing the value of PRECEDING
impacts our results. We can do the same thing using FOLLOWING
.
FOLLOWING
In the examples above, our window function has been run against the current row and some amount of preceding rows. Let's flip that and start at the current row with some amount of following rows instead. Let's start with UNBOUNDED FOLLOWING
:
SELECT name AS state_name,
popestimate2015 AS state_population,
SUM(popestimate2015)
OVER(ORDER BY name RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS national_population
FROM population
WHERE state > 0 -- only state-level rows
;
Now in our query, we're starting with the current row and performing our window function against it and all the following rows. This will give us a reverse tally of the population sums from what we saw with UNBOUNDED PRECEDING
. Take a look:
state_name | state_population | national_population
--------------------------------------------------------------
Alabama | 4858979 | 324893002
Alaska | 738432 | 320034023
Arizona | 6828065 | 319295591
Arkansas | 2978204 | 312467526
California | 39144818 | 309489322
Colorado | 5456574 | 270344504
Connecticut | 3590886 | 264887930
Delaware | 945934 | 261297044
District of Columbia | 672228 | 260351110
Florida | 20271272 | 259678882
. . . .
Here we can see that we actually start out with the full national population value of 324893002 because we're starting with the current row (Alabama in this case) and summing all the following rows with it. As we move to Alaska, Alabama's population isn't included anymore. Instead now we're summing the population for Alaska and all the following rows, and so on.
Just as we did with PRECEDING
, we can also supply a value of following rows to include rather than do UNBOUNDED
. Let's use 2 again to keep things simple:
SELECT name AS state_name,
popestimate2015 AS state_population,
SUM(popestimate2015)
OVER(ORDER BY name ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) AS national_population
FROM population
WHERE state > 0 -- only state-level rows
;
Again, we have to switch from the RANGE
indicator to ROWS
since RANGE
can only be used with UNBOUNDED
. We've also set 2 as the number of following rows instead of UNBOUNDED
. Here's what we get:
state_name | state_population | national_population
--------------------------------------------------------------
Alabama | 4858979 | 12425476
Alaska | 738432 | 10544701
Arizona | 6828065 | 48951087
Arkansas | 2978204 | 47579596
California | 39144818 | 48192278
Colorado | 5456574 | 9993394
Connecticut | 3590886 | 5209048
Delaware | 945934 | 21889434
District of Columbia | 672228 | 31158360
Florida | 20271272 | 31917735
. . . .
Let's look at Colorado again as our example to understand what's happening in these results. The "national_population" value for Colorado is 9993394. It includes the current row of Colorado's population (5456574) and the two following rows of Connecticut (3590886) and Delaware (945934). As we mentioned above, this kind of aggregation functionality makes more sense to use in a data set where adjacency or sequence matters, but it's easy enough for us to understand what's happening by using this population data set that we're familiar with.
BETWEEN x PRECEDING AND y FOLLOWING
Finally, we can combine PRECEDING
and FOLLOWING
to specify exactly how many rows before the current row and how many rows after the current row to include for the window function. Let's use 2 in either direction:
SELECT name AS state_name,
popestimate2015 AS state_population,
SUM(popestimate2015)
OVER(ORDER BY name ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS national_population
FROM population
WHERE state > 0 -- only state-level rows
;
And our results:
state_name | state_population | national_population
--------------------------------------------------------------
Alabama | 4858979 | 12425476
Alaska | 738432 | 15403680
Arizona | 6828065 | 54548498
Arkansas | 2978204 | 55146093
California | 39144818 | 57998547
Colorado | 5456574 | 52116416
Connecticut | 3590886 | 49810440
Delaware | 945934 | 30936894
District of Columbia | 672228 | 35695180
Florida | 20271272 | 33535897
. . . .
Looking at Colorado again, we see that the value for the "national_population" column is 52116416. That's the sum of Colorado (5456574) as the current row plus two the rows preceding, California (39144818) and Arkansas (2978204), and the two rows following, Connecticut (3590886) and Delaware (945934)
Remember at the beginning we noted that if we're not including an ORDER BY
clause in our window function, then by default, the aggregation is performed over all the rows, that is RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
. It's the exact same thing as our original query, but because we now have an ORDER BY
clause in our window function, let's state it explicitly to demonstrate the behavior:
SELECT name AS state_name,
popestimate2015 AS state_population,
SUM(popestimate2015)
OVER(ORDER BY name RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS national_population
FROM population
WHERE state > 0 -- only state-level rows
;
And the results we get are the sum of all the rows to give us the national population:
state_name | state_population | national_population
--------------------------------------------------------------
Alabama | 4858979 | 324893002
Alaska | 738432 | 324893002
Arizona | 6828065 | 324893002
Arkansas | 2978204 | 324893002
California | 39144818 | 324893002
Colorado | 5456574 | 324893002
Connecticut | 3590886 | 324893002
Delaware | 945934 | 324893002
District of Columbia | 672228 | 324893002
Florida | 20271272 | 324893002
. . . .
Now, we're back where we started!
Partitioning
As we saw in our previous article, using the PARTITION BY
condition in our window function effectively groups the results according to the values in the field we're partitioning by. In one example from that article, we partitioned by region for each state, so we saw states grouped to "West" or "South" or "Northeast" regions, for example. We can do the same thing here, but also include a window frame specification. Here's and example:
SELECT name AS state_name,
popestimate2015 AS state_population,
region,
SUM(popestimate2015)
OVER(PARTITION BY region ORDER BY name RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS regional_population
FROM population
WHERE state > 0 -- only state-level rows
;
Notice, we've added the "region" column and we've also added the PARTITION BY
condition for region to our window function. We're using RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
to be explicit about what we're doing, but because that is the default behavior for a window frame, we don't need to actually include it. Here are the results we get:
state_name | state_population | region | regional_population
-------------------------------------------------------------------------
Illinois | 12859995 | Midwest | 12859995
Indiana | 6619680 | Midwest | 19479675
Iowa | 3123899 | Midwest | 22603574
Kansas | 2911641 | Midwest | 25515215
Michigan | 9922576 | Midwest | 35437791
Minnesota | 5489594 | Midwest | 40927385
Missouri | 6083672 | Midwest | 47011057
Nebraska | 1896190 | Midwest | 48907247
North Dakota | 756927 | Midwest | 49664174
Ohio | 11613423 | Midwest | 61277597
South Dakota | 858469 | Midwest | 62136066
Wisconsin | 5771337 | Midwest | 67907403
Connecticut | 3590886 | Northeast | 3590886
Maine | 1329328 | Northeast | 4920214
Massachusetts | 6794422 | Northeast | 11714636
. . . .
You are probably asking yourself, "What just happened? Where did Alabama go?" In this case, because we are using PARTITION BY
, the results are first grouped by the partition field values in order, then by the ORDER BY
clause we specified which is the state name. So, with the Midwest region, our states are listed in alphabetical order. Also, because we used RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
, we're getting the running tally of the sum of the state populations within just that region. So, for example, if we look at Kansas, the "regional_population" column has a value of 25515215. That includes the state population of Kansas (2911641) and all of the preceding rows, Iowa (3123899), Indiana (6619680), and Illinois (12859995). Once the Midwest is completed, we're on to the next partition for the Northeast. Massachusetts is showing the "regional_population" as 11714636, which includes the population for Massachusetts (6794422) and the preceding rows in the Northeast, Maine (1329328) and Connecticut (3590886).
Same functionality we saw in our examples above, but now just grouped according to partition.
A Final Note about Range vs Rows
As we mentioned a couple times, the RANGE
indicator can only be used with UNBOUNDED
, while ROWS
can actually be used for all of the options we discussed here. Besides that, they actually treat aggregations differently in a very small but important way: if the field you use for ORDER BY
does not contain unique values for each row, then RANGE
will combine all the rows it comes across for non-unique values rather than processing them one at a time whereas ROWS
will include all of the rows in the non-unique bunch but process each of them separately. Our examples have all ordered by state name, which are unique to each row in our data set, so it didn't matter for us. However, if you order by date and your data set has more than one row per date, for example, you'll want to make sure you choose between using ROWS
and RANGE
appropriately for your intended results.
To demonstrate this, let's make a small change to the PARTITION BY
query we used above. In this case, all we're going to do is change our ORDER BY
clause to order by region rather than by state name. Because region has non-unique values for each row, RANGE
will combine them. Take a look:
SELECT name AS state_name,
popestimate2015 AS state_population,
region,
SUM(popestimate2015)
OVER(PARTITION BY region ORDER BY region RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS regional_population
FROM population
WHERE state > 0 -- only state-level rows
;
We'll see that the state names are in random order, but that the regional sum is now the combination of all the rows of the region:
state_name | state_population | region | regional_population
-------------------------------------------------------------------------
North Dakota | 756927 | Midwest | 67907403
Indiana | 6619680 | Midwest | 67907403
Michigan | 9922576 | Midwest | 67907403
Ohio | 11613423 | Midwest | 67907403
Wisconsin | 5771337 | Midwest | 67907403
Iowa | 3123899 | Midwest | 67907403
Kansas | 2911641 | Midwest | 67907403
Minnesota | 5489594 | Midwest | 67907403
Missouri | 6083672 | Midwest | 67907403
Nebraska | 1896190 | Midwest | 67907403
South Dakota | 858469 | Midwest | 67907403
Illinois | 12859995 | Midwest | 67907403
Connecticut | 3590886 | Northeast | 56283891
Maine | 1329328 | Northeast | 56283891
Massachusetts | 6794422 | Northeast | 56283891
. . . .
In this particular example, it amounts to the same thing as summing across all the rows in the partition so we might as well not even have the conditions for ORDER BY
or RANGE
in there. In cases without a partition, the combination of the rows with non-unique values may not be what you're looking for so be aware of this behavior.
Let's see how using ROWS
changes the results for that query:
SELECT name AS state_name,
popestimate2015 AS state_population,
region,
SUM(popestimate2015)
OVER(PARTITION BY region ORDER BY region ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS regional_population
FROM population
WHERE state > 0 -- only state-level rows
;
Now we've got the running tally again as we saw above, with the total regional population of the Midwest eventually being calculated as 67907403. We just don't have our states in any particular order:
state_name | state_population | region | regional_population
-------------------------------------------------------------------------
North Dakota | 756927 | Midwest | 756927
Indiana | 6619680 | Midwest | 7376607
Michigan | 9922576 | Midwest | 17299183
Ohio | 11613423 | Midwest | 28912606
Wisconsin | 5771337 | Midwest | 34683943
Iowa | 3123899 | Midwest | 37807842
Kansas | 2911641 | Midwest | 40719483
Minnesota | 5489594 | Midwest | 46209077
Missouri | 6083672 | Midwest | 52292749
Nebraska | 1896190 | Midwest | 54188939
South Dakota | 858469 | Midwest | 55047408
Illinois | 12859995 | Midwest | 67907403
Connecticut | 3590886 | Northeast | 3590886
Maine | 1329328 | Northeast | 4920214
Massachusetts | 6794422 | Northeast | 11714636
. . . .
Wrapping Up
In this article, we built on top of what we'd learned previously about using window functions in PostgreSQL by learning how to use window frame options to get different results. For reference, the PostgreSQL documentation has a good little section on window frames in the chapter about SQL Syntax as well as the official writeup about window functions.
In our next Metrics Maven article, we'll apply what we've learned here to calculate a moving average.
Make sure you're getting the latest from Compose by subscribing to Compose Articles or following us on Twitter, Facebook or Google+ by using the links below.