PostgreSQL, Backups and everything you need to know

In his latest Write Stuff article Robert Wysocki explains PostgreSQL backups, how to pick which ones to do and what the pros and cons are.

You know this good ol' IT saying: "There are two types of people: those who back up their data and those who are yet to start". Backing up databases is especially important since they usually contain information critical not only for day-to-day business operation, but also for business continuity. People leave, software fails, the world changes. What remains is your data and if you have it, you can always start from scratch - with new software, new people, in a new world. Let's have a look at various backups strategies available for PostgreSQL databases so that we can prepare for the worst.

Backup types: physical vs logical

When talking about database backups, the first distinction mentioned is usually about filesystem-level snapshots vs SQL dumps. A database is essentially a bunch of files sitting somewhere on a drive in some data center. Backing up files is easy, isn't it? But from another point of view, a database can be perceived as a sequence of SQL statements used to create and manipulate the data. How hard can it be to repeat all these statements?

Both backup types are useful, but each comes with its own set of advantages and disadvantages. And both have some caveats.

Physical backups

Physical backups, also known as filesystem-level backups, are basically snapshots of files constituting the database. But with constant writes going to the underlying files, it can be quite hard to make these snapshots internally consistent. PostgreSQL documentation describes two key notions: continuous archiving and point-in-time recovery. These complement each other and it's important to understand them.

In order for physical backups to be consistent, we have to have a way of guaranteeing transaction atomicity and durability; in other words we have to make sure that we either commit the entire transaction or leave the database unchanged. And if we do commit, it has to be forever. PostgreSQL uses write-ahead logging to provide the A and D letters of the ACID acronym. Write-ahead log segments are exactly the files that the continuous archiving process takes care of, well, archiving. And the information kept in WAL segments allows the database engine to keep the data consistent but also to to bring it back to a consistent state after a crash.

While it is normal for some database files to change during a filesystem backup creation, not all the changes are safe and some may lead to an inconsistent snapshot being created. That's why, apart from some nice command line tools, PostgreSQL provides also a low-level API for taking physical backups. Executing pg_start_backups() before and pg_stop_backup() after ensures that there will be no unsafe changes done to the data files during the process. But still it's important to remember that a physical backup is useless without all the WAL segments generated between executing the two functions mentioned above.

The filesystem-level snapshot along with all the WAL segments necessary for its restoration is called a "base backup".

Stand-alone physical backups vs PITR-enabled physical backups

While it was said that continuous archiving and point-in-time recovery (PITR) complement each other, they can also be decoupled.

It is very useful to have continuous archiving configured so that you are always sure you have access to necessary WAL segments, but in order to have a functional base backup you only need the ones generated while the backup was being created. If you have only those segments and do not archive the ones generated after your backup is ready, you have a "stand-alone physical backup".

However, you can always continue to archive write-ahead log segments (and thus actually have continuous archiving). What that will give you is the ability to restore your database from the base backup you have and then continue to replay WAL segments on it. This is called "point-in-time recovery", since you can always stop replaying the WAL and still have a consistent state; so essentially you can restore your database to any point in time starting with the time you created your base backup.

Logical backups

Logical backups are also known as SQL dumps. This is because they contain SQL statements used to create the database schema and fill it with data. An SQL dump always represents a consistent state of the database taken at a given moment since the dumping process follows exactly the same semantics as any other ordinary database session. This also means that it places locks and, on the other hand, can be forced to wait for other sessions to release their locks.

PostgreSQL documentation gives an overview of the process. The dumping software simply goes through all the tables discovering the various schema and fetching all their rows. The process isn't overly complicated, though it has to be clever enough to know about the order in which to go through all the relations so that, when being restored, all the constraints can be met.

It is perfectly normal for SQL dumps to have the various tables' data spread over time meaning while you can have the last row of one table with timestamp A, some other table might have rows with timestamp B. If you have rules outside your database about how rows like that correlate, you should account for that yourself.

Entire instance logical backups vs single database logical backups

In one PostgreSQL installation, we can have multiple databases. This is referred to as "cluster" in the official documentation, but this term is confusing enough to not to use it, so let's call it simply an "instance".

Although dumping a single database is enough to restore all of its schema and data, there are also some other extra objects that could be necessary to bring the restored database to full operation.

All privileges are dumped with the database, but database roles, memberships and attributes are global objects that exist on the instance level and therefore are not included in a single database dump.

PostgreSQL provides two tools for creating logical backups. While pg_dumpall can be used to dump all the objects in an instance, it can also serve as complementary tool to the other tool and dump only the global objects. That other tool is pg_dump which is used to dump a single database from an instance.

The good, the bad and the caveat

Physical backups

ProsCons
No locking - physical backups do not cause lock contention and do not depend on other connections releasing their locksHigh I/O pressure - copying the entire data directory may put significant pressure on your storage, but this doesn't have to necessarily apply if you're able to use some sort of filesystem-level snapshotting
Entire instance backed up - there's no need to restore single databases - one restore process is enough for an entire clusterEntire instance backed up - there's no way to restore single databases or single tables, one has to restore all or nothing
Almost instantaneous restore process - no need to execute SQL statements to get data back Very large base backups - physical backups include all the indices and bloat and may therefore be much, much larger than SQL dumps
PITR capabilities - the ability to go back in time Need to archive WAL segments - you need to provide additional storage for binary log files
Just works - you just run postgres with the backed up directory and it does the jobBinary format - having backup in binary files means you are limited to restoring it using exactly the same PostgreSQL version it was created with

Logical backups

ProsCons
Less data to copy - logical backups do not copy all data files, so they may impose a lower I/O pressure on the system than physical backupsLocking - logical backups are both prone to and cause lock waits
Single database objects can be backed up - it's easy to backup and restore even a single table
Smaller backup files - SQL dumps do not contain index data or bloat, so they are much smaller than physical backupsLong restoration process - SQL statements have to be executed to get schema and data back
No need for storing extra files - backups are self-contained No PITR capabilities - a SQL dump can serve only as a snapshot of the data at the time it was created
Cross-version compatibility - SQL dumps are not dependant on restoring with the same version as was used to create them Extra effort required - care has to be taken to back up global objects and to tune the server to allow for faster restoration etc.

Compose.io supported backups

Compose gives you standalone physical backups. This means you should take extra care of the PostgreSQL version you want to use with your backups.

Since the standard pg_dump tool uses a regular database session, there's nothing stopping you from creating SQL dumps of a Compose-hosted database. But, since you don't have access to the database superuser account, you cannot use pg_dumpall to dump global database instance objects.

When it comes to restoring from backups, all you really can do is to use the ones available to you in Compose's WebUI. You might be able to partially restore your SQL dumps, but you'll get errors if you'll try to restore them in full.

Robert M. Wysocki is a freelance PostgreSQL consultant, experienced GNU/Linux administrator and author of travellingsysop.net technical blog

This article is licensed with CC-BY-NC-SA 4.0 by Compose.

Image by Tony Webster