PostgreSQL - Series, Random and With

We get to talk to people about databases every day at Compose and often end up introducing them to some new facet of a database they already use which will make their lives easier. Now, we're going to bring those useful snippets of knowledge to you, in case you didn't already know them. Today, we're talking PostgreSQL.

Generate series

The generate_series function in PostgreSQL is one of those handy swiss army knife functions. What it does is take a minimum, a maximum and an interval as parameters and generate a series of values based on those values. So what can you do with a series like that? Well you can quickly create a big populated table:

> CREATE TABLE bigone AS SELECT generate_series(1,1000000) AS a;
> SELECT * FROM bigone;
    a    
---------
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
...
  999999
 1000000
(1000000 rows)

Which is useful to generate up some big tables for testing. But generate_series is more useful that that. For example, we can use it to generate up sequences of dates...

> SELECT * FROM generate_series(now() - interval '1 month', now(), interval '1 day');
       generate_series        
------------------------------
 2015-08-24 14:19:39.54009+00
 2015-08-25 14:19:39.54009+00
 2015-08-26 14:19:39.54009+00
 2015-08-27 14:19:39.54009+00
 2015-08-28 14:19:39.54009+00
 2015-08-29 14:19:39.54009+00
 2015-08-30 14:19:39.54009+00
 2015-08-31 14:19:39.54009+00
...
 2015-09-23 14:19:39.54009+00
 2015-09-24 14:19:39.54009+00
(32 rows)

The first parameter is now() minus a month, the second now(), the series increments by a day. You can use this as part of a join to create a table that evenly maps over a series of days, even when the original data skips days. That's useful when you're feeding your data into analytic processes.

ORDER BY RANDOM()

Here's a little something you can do but be very careful with it. If you want to get a random sample of data from your table, then ORDER BY RANDOM() could help. You have to LIMIT it of course otherwise you won't get a sample. So a query like...

SELECT * FROM films ORDER BY RANDOM() LIMIT 10;  

... would get you ten random items from the films table. But there's a price to pay as it has to pull in most of the table to do that random shuffle and then take ten items out.

If you've got a big table, you're going to start, well, hammering your database. There's no fix for that in PostgreSQL 9.4 but in PostgreSQL 9.5, we have support for SQL:2003's TABLESAMPLE coming. That can be used in a SELECT command and be given a percentage of a table's rows to retrieve. Check out this DSHL blog post for a closer look at it and the two algorithms backing it.

WITH and Common Table Expressions

Coming from MySQL to PostgreSQL, you can find yourself not making best use of PostgreSQL's much richer SQL environment. I was reminded of that when looking at a recent Write Stuff article Simple Index Checking with PostgreSQL. Working with the various performance tracking tables of PostgreSQL is very interesting, but it was the query behind the article which set me off. It starts like this...

with table_stats as (  
select psut.relname,  
  psut.n_live_tup,
  1.0 * psut.idx_scan / greatest(1, psut.seq_scan + psut.idx_scan) as index_use_ratio
from pg_stat_user_tables psut  
order by psut.n_live_tup desc  
),
...

I showed it to a peer and their first reaction was "With?". WITH is a powerful SQL command that you'll find in many databases that support the SQL-99 standard. In MySQL though, you have to work around the lack of the WITH statement or, if you've only been exposed to MySQL, never know it existed.

Aside: We're using the DVD Rental example as our database. It has film and actor tables, staff and customer tables, and all the intermediate relational tables to join them together.

WITH is an expressive way of working with a query and can make much more readable SQL queries. It basically allows you to create multiple transient tables for use in a query, all scoped to that one statement. Lets show a trivial example of the WITH syntax. You start a query with a WITH and then name a new transient table for this query followed by AS (...

WITH actor_names AS (  

Now you can define a query to create that table. This is called a CTE, a Common Table Expression and it looks just like a SELECT statement:

    SELECT actor_id, first_name || ' ' || last_name AS name FROM actor
)

We can now use that table of full names in the subsequent query which immediately follows the WITH.

    SELECT film.title, actor_names.name 
    FROM film,film_actor,actor_names 
    WHERE film.film_id=film_actor.film_id AND film_actor.actor_id=actor_names.actor_id;

and out comes...

            title            |         name         
-----------------------------+----------------------
 Wizard Coldblooded          | Penelope Guiness
 Westward Seabiscuit         | Penelope Guiness
 Vertigo Northwest           | Penelope Guiness
 Splash Gump                 | Penelope Guiness
 Rules Human                 | Penelope Guiness
 Oklahoma Jumanji            | Penelope Guiness
...

There's no limitations on your queries too so we could start with a standalone query like this

SELECT  film.title, array_agg(actor.first_name || ' ' || actor.last_name) as actor_list  
FROM film, film_actor, actor  
WHERE film.film_id = film_actor.film_id and film_actor.actor_id = actor.actor_id  
GROUP BY film.title  

Where we are building a cast list for the films and wrap it with a WITH so we can query the resulting table...

WITH film_cast AS (  
  SELECT  film.title, array_agg(actor.first_name || ' ' || actor.last_name) as actor_list
  FROM film, film_actor, actor
  WHERE film.film_id = film_actor.film_id and film_actor.actor_id = actor.actor_id
  GROUP BY film.title
)
SELECT * FROM film_cast WHERE array_length(actor_list,1)>10  

And now, we get a list of films with over ten cast members. The thing with WITH is that you can have a chain of those CTEs, each one creating tables the later CTEs can use. So we could combine the first and the second examples here and create a table of actor names first and use that.

WITH actor_names AS (  
  SELECT actor_id, first_name || ' ' || last_name AS name FROM actor
),
film_cast AS (  
  SELECT  film.title, array_agg(actor_names.name) as actor_list
  FROM film, film_actor, actor_names
  WHERE film.film_id = film_actor.film_id and film_actor.actor_id = actor_names.actor_id
  GROUP BY film.title
)
SELECT * FROM film_cast WHERE array_length(actor_list,1)>10  

All we've done here is got our film cast query to refer to the actor_names CTE. Now, what you express with WITH can be expressed with joins and other SQL syntax, but the ability to clearly show the steps of evaluation. WITH also makes for more composable queries, as you can gather queries together. If you want to see a more extensive, and more sensible, use of WITH, check out the index checking article we started with here.