Deeper into Postgres 9.5 - New Group By Options for Aggregation

If you use PostgreSQL, you'll know that its relational model and aggregate functions lend themselves easily to the creation of business reports. In 9.5, aggregating data for reports is made more powerful by the addition of some new GROUP BY options: GROUPING SETS, ROLLUP, and CUBE.

Group By

You've probably used GROUP BY zillions of times to return counts, sums, averages, or other aggregations on data, grouped according to the fields you specify, such as by date, category, or geography. What you get back are aggregate metrics of the rows that fit into the combination of the grouping fields you've defined. The new options for GROUP BY in 9.5 let you manipulate this base structure to produce alternative reports of aggregated metrics by sets, hierarchy, and combinations (easily!). Seeing the sets side-by-side, how they fit together hierarchically, and the data combinations they create can provide a deeper understanding of your data.

We'll look at examples of these new options by querying the United States Census data on estimated population. For our examples, we're going to focus on the rows which contain state-level information - each row contains the data for one state. Before we look at the new options, though, let's remind ourselves what a simple aggregation query looks like which just uses the standard GROUP BY clause so we have a base for comparison against the output of the new options. So, when we run SELECT... COUNT(*) FROM... WHERE... GROUP BY...:

SELECT region, division, COUNT(*)  
 FROM population
 WHERE state > 0 -- states only
 GROUP BY region, division;

What we get back is the combination of the field values for the fields we specified in GROUP BY with the aggregated count of states:

region          |  division            | count  
------------------------------------------------
Northeast       |  Middle Atlantic     |  3  
South           |  East South Central  |  4  
Not Applicable  |  Not Applicable      |  1  
South           |  South Atlantic      |  9  
South           |  West South Central  |  4  
West            |  Mountain            |  8  
Midwest         |  East North Central  |  5  
West            |  Pacific             |  5  
Northeast       |  New England         |  6  
Midwest         |  West North Central  |  7  

Look familiar? Now, let's move on to see what kind of additional reports these new options can provide.

Grouping Sets

The GROUPING SETS option allows us to create sets for each field specified in the GROUP BY clause of aggregation queries. So, instead of the combination of the field values, we'll get subtotals for each of the field values and a grand total of all the matching rows. Let's look at the same simple COUNT(*) example but using GROUPING SETS in the GROUP BY clause instead:

SELECT region, division, COUNT(*)  
 FROM population
 WHERE state > 0 -- states only
 GROUP BY GROUPING SETS (region, division, ());

Our results show us the aggregates for each set:

region          |  division            | count  
------------------------------------------------
Midwest         |                      |  12  
Northeast       |                      |  9  
Not Applicable  |                      |  1  
South           |                      |  17  
West            |                      |  13  
                |                      |  52
                |  East North Central  |  5
                |  East South Central  |  4
                |  Middle Atlantic     |  3
                |  Mountain            |  8
                |  New England         |  6
                |  Not Applicable      |  1
                |  Pacific             |  5
                |  South Atlantic      |  9
                |  West North Central  |  7
                |  West South Central  |  4

Our sets include the count of states for each of the regions and each of the divisions by which states are categorized. We also have a grand total of 52 rows for each of the grouping sets, which may not have been obvious when we used the standard GROUP BY clause previously. "But wait!" you say. "Why are there 52 states? Doesn't the United States only have 50 states?" In this data set, Washington D.C. (District of Columbia) is considered a state, as is the territory of Puerto Rico. Puerto Rico is "Not Applicable" to either the regions or the divisions.

Now, let's look at something more interesting than COUNT(*) for the states. How about an estimated population summary for 2015 based on the population of each of the states grouped to region and division sets?

SELECT region, division, SUM(POPESTIMATE2015) AS est_pop  
 FROM population
 WHERE state > 0 -- states only
 GROUP BY GROUPING SETS (region, division, ());

Voila! Our results show us how population breaks out across each region and each division in the United States:

region          |  division            | est_pop  
---------------------------------------------------
Midwest         |                      |  67907403  
Northeast       |                      |  56283891  
Not Applicable  |                      |  3474182  
South           |                      |  121182847  
West            |                      |  76044679  
                |                      |  324893002
                |  East North Central  |  46787011
                |  East South Central  |  18876703
                |  Middle Atlantic     |  41556307
                |  Mountain            |  23530498
                |  New England         |  14727584
                |  Not Applicable      |  3474182
                |  Pacific             |  52514181
                |  South Atlantic      |  63276764
                |  West North Central  |  21120392
                |  West South Central  |  39029380

Here we can see that the sum total estimated population in 2015 for all the 52 states is 324,893,002 and we can see how it splits out across the regions and divisions each state belongs to.

How about average births and deaths? Let's see:

SELECT region,  
  division,
  CAST(AVG(BIRTHS2015) AS INTEGER) AS avg_births,
  CAST(AVG(DEATHS2015) AS INTEGER) AS avg_deaths
 FROM population
 WHERE state > 0 -- states only
 GROUP BY GROUPING SETS (region, division, ());

Our results:

region          |  division            | avg_births  |  avg_deaths  
------------------------------------------------------------------
Midwest         |                      |  69592      |  49112  
Northeast       |                      |  70610      |  53294  
Not Applicable  |                      |  34422      |  30039  
South           |                      |  90264      |  60212  
West            |                      |  75449      |  40957  
                |                      |  77314      |  51059
                |  East North Central  |  112300     |  82659
                |  East South Central  |  58451      |  46423
                |  Middle Atlantic     |  161264     |  118556
                |  Mountain            |  38915      |  21122
                |  New England         |  25282      |  20663
                |  Not Applicable      |  34422      |  30039
                |  Pacific             |  133904     |  72693
                |  South Atlantic      |  83034      |  60539
                |  West North Central  |  39086      |  25150
                |  West South Central  |  138347     |  73264

Clearly the average births are higher in all areas than deaths (population is on the rise!), but we do have some areas that are booming baby-wise compared to the average deaths in that area. The average across the nation for 2015, is 77,314 births per state and 51,059 deaths per state.

Now that we've explored GROUPING SETS and have a good idea what to expect from this kind of report, let's have a look at the other options.

Rollup

ROLLUP, as you probably guessed, lets us create a hierarchical rollup starting with the primary group, then the secondary group, and so on... for however many levels we have. The hierarchy is based on the order the fields are listed in the GROUP BY clause. We've only got two levels so ours will be pretty straightforward, but let's get our state count, sum of estimated population, and average births and deaths in one query:

SELECT region,  
  division, 
  COUNT(*),
  SUM(POPESTIMATE2015) AS est_pop,
  CAST(AVG(BIRTHS2015) AS INTEGER) AS avg_births,
  CAST(AVG(DEATHS2015) AS INTEGER) AS avg_deaths
 FROM population
 WHERE state > 0 -- states only
 GROUP BY ROLLUP (region, division);

Let's check our rollup report:

region          |  division            | count  |  est_pop    |  avg_births  |  avg_deaths  
------------------------------------------------------------------------------------------
Midwest         |  East North Central  |  5     |  46787011   |  112300      |  82659  
Midwest         |  West North Central  |  7     |  21120392   |  39086       |  25150  
Midwest         |                      |  12    |  67907403   |  69592       |  49112  
Northeast       |  Middle Atlantic     |  3     |  41556307   |  161264      |  118556  
Northeast       |  New England         |  6     |  14727584   |  25282       |  20663  
Northeast       |                      |  9     |  56283891   |  70610       |  53294  
Not Applicable  |  Not Applicable      |  1     |  3474182    |  34422       |  30039  
Not Applicable  |                      |  1     |  3474182    |  34422       |  30039  
South           |  East South Central  |  4     |  18876703   |  58451       |  46423  
South           |  South Atlantic      |  9     |  63276764   |  83034       |  60539  
South           |  West South Central  |  4     |  39029380   |  138347      |  73264  
South           |                      |  17    |  121182847  |  90264       |  60212  
West            |  Mountain            |  8     |  23530498   |  38915       |  21122  
West            |  Pacific             |  5     |  52514181   |  133904      |  72693  
West            |                      |  13    |  76044679   |  75449       |  40957  
                |                      |  52    |  324893002  |  77314       |  51059

Now we've got summaries for each division within each region, a subtotal at the region level, and the grand total line with the 52 states represented.

Cubes

Next, we can use CUBE in the GROUP BY clause to get all the possible combinations. Basically this will give us a combined report of sets and of the hierarchical rollup. Our query:

SELECT region,  
  division, 
  COUNT(*),
  SUM(POPESTIMATE2015) AS est_pop,
  CAST(AVG(BIRTHS2015) AS INTEGER) AS avg_births,
  CAST(AVG(DEATHS2015) AS INTEGER) AS avg_deaths
 FROM population
 WHERE state > 0 -- states only
 GROUP BY CUBE (region, division);

And we get back "all-the-things":

region          |  division            | count  |  est_pop    |  avg_births  |  avg_deaths  
------------------------------------------------------------------------------------------
Midwest         |  East North Central  |  5     |  46787011   |  112300      |  82659  
Midwest         |  West North Central  |  7     |  21120392   |  39086       |  25150  
Midwest         |                      |  12    |  67907403   |  69592       |  49112  
Northeast       |  Middle Atlantic     |  3     |  41556307   |  161264      |  118556  
Northeast       |  New England         |  6     |  14727584   |  25282       |  20663  
Northeast       |                      |  9     |  56283891   |  70610       |  53294  
Not Applicable  |  Not Applicable      |  1     |  3474182    |  34422       |  30039  
Not Applicable  |                      |  1     |  3474182    |  34422       |  30039  
South           |  East South Central  |  4     |  18876703   |  58451       |  46423  
South           |  South Atlantic      |  9     |  63276764   |  83034       |  60539  
South           |  West South Central  |  4     |  39029380   |  138347      |  73264  
South           |                      |  17    |  121182847  |  90264       |  60212  
West            |  Mountain            |  8     |  23530498   |  38915       |  21122  
West            |  Pacific             |  5     |  52514181   |  133904      |  72693  
West            |                      |  13    |  76044679   |  75449       |  40957  
                |                      |  52    |  324893002  |  77314       |  51059
                |  East North Central  |  5     |  46787011   |  112300      |  82659
                |  East South Central  |  4     |  18876703   |  58451       |  46423
                |  Middle Atlantic     |  3     |  41556307   |  161264      |  118556
                |  Mountain            |  8     |  23530498   |  38915       |  21122
                |  New England         |  6     |  14727584   |  25282       |  20663
                |  Not Applicable      |  1     |  3474182    |  34422       |  30039
                |  Pacific             |  5     |  52514181   |  133904      |  72693
                |  South Atlantic      |  9     |  63276764   |  83034       |  60539
                |  West North Central  |  7     |  21120392   |  39086       |  25150
                |  West South Central  |  4     |  39029380   |  138347      |  73264

Now we have the data fully broken out in all the possible combinations.

Wrapping Up

In this article, we've looked at some new GROUP BY options in PostgreSQL 9.5 that let us create alternative reports (easily!) of aggregated data. We've seen how we can create a report based on sets, how we can create a hierarchical rollup, and how we can report on all the possible combinations of data according to the fields we specify. All of this, plus subtotals and grand totals included! These new options for aggregation will quickly become a favorite for PostgreSQL report developers and analysts.