TL;DR: Core to our PostgreSQL High Availability system is the Governor application which uses Etcd as its repository of truth to discover which database instance is leader.
When we decided to bring PostgreSQL to the Compose platform, we already knew that we'd have to come up with a high availability (HA) solution. We run a lot of databases at Compose and most of them have come with some HA option out of the box. PostgreSQL isn't like that though and although there are HA tools out there, we looked at them and concluded they didn't fit our flexible, highly-configurable way of deploying databases on the Compose platform.
So, we had to build our own system for PostgreSQL HA and rather than reinvent the wheel we used some common tools and techniques to accomplish this such as etcd, haproxy and PostgreSQL's own WAL streaming. Let's first look at how these components are arranged over a typical set of four hosts.
Etcd is the repository for truth in our HA configuration when it comes to working out which instance of PostgreSQL is leader. If you've not come across it, Etcd is a open-source distributed key/value store which you can talk to via a HTTP API. We run it on each node and it automatically distributes and replicates key/value changes through the cluster using the RAFT consensus algorithm to keep things coordinated and resistant to faults. Among Etcd's capabilities is being able to set a time-to-live on key/values allowing information to expire and doing "check and set" atomic operations on the values in its distributed store. These are both capabilities we exploit.
Etcd runs on the nodes where PostgreSQL runs and on one dedicated etcd only node. On the PostgreSQL nodes, the Governor application then takes care of running the PostgreSQL database itself. On all these nodes, etcd is set to run automatically and form a cluster. The only node not running etcd is the HAproxy node which will handle incoming connections.
PostgreSQL's role in the HA process is to be a database and to use WAL (Write-Ahead Log) streaming to move data from the leader node and follower nodes once it has been told what its node's role is. That role is determined by the Governor using etcd.
Governing the uninitialized database...
When a previously uninitialized node comes up, the Governor for that node will enter a leader race. It does that by setting an init key in etcd using a parameter
prevExist=false. This parameter means that if there's a previously existing value set for the init key, the attempt to set it will fail. The first node to successfully set it gets to become leader, start up PostgreSQL, initialize the database and set another value in etcd, the leader key. Everyone else waits to see what the leader key is set to and then starts their PostgreSQL up so that it follows the owner of the leader key.
When any Governor does set the leader key, it's set using a TTL (time-to-live) parameter of 30 seconds. That means that it has to check back regularly and reset the leader key, if, that is, it still owns it. It uses another etcd parameter,
prevValue, to check that ownership is being maintained. Other Governors keep checking the leader key in case it becomes empty – if it does that means the leader has vanished...
Governing the initialized database...
If the Governor for a node detects data in the PostgreSQL data directory, it assumes the node has been initialized and takes a different approach to starting up. It connects to etcd and then starts PostgreSQL up as a secondary, leader-less database. Once it is up and running, it then goes to see who owns the leader key in etcd. If it owns it, then it immediately tells the PostgreSQL database that it is leader. But if another host owns the key, the Governor tells PostgreSQL to follow that host. If no one has the leader key it runs health checks and takes over as leader.
When a Governor is running the leader, it regularly posts what its position in the Write Ahead Log is into a leader WAL position key in etcd. Follower nodes do something similar, posting their positions in the WAL into a data node WAL positions value. That comes into play next when something goes astray...
Governing when the leader goes
Each Governor keeps an eye on that leader key. If it changes the Governor tells PostgreSQL to follow the new leader. If it doesn't change the Governor waits a cycle to check again.
It's when that leader key disappears, then the failure handling process kicks in. Each Governor will, on seeing a missing key, start the process of working out if it is a suitable candidate to become leader. It does this by looking at the leader WAL position data in etcd. If it's within 1MB of the leader's last reported position, and it passes some other health checks, then the Governor moves onto the next stage of accession.
That means checking if anyone's more suited to take the leader role and to do this the Governor looks at the data node WAL positions to see if anyone is closer than its local PostgreSQL to where the previous leader was. If there is, it steps back and waits for that node to become leader.
The final step is to take over the leader key and the Governor will attempt to set it with
TTL=30. If it succeeds, the leader role is taken and the Governor promotes the local PostgreSQL to the leader. If it fails, it waits for a node to take the leader role again.
Connecting to the database
With all this going on, there is one other task the Governor takes on. The HAProxy node is configured to ask a Governor, any Governor, who's currently leader. The HAProxy node then uses TCP routing to send its traffic to that node so you can have an uninterrupted interaction with your database.
Simply and Safely Available with the Governor
We came up with our PostgreSQL solution to ensure we had a simple, coherent, maintainable way of managing high availability for our clusters. The design was guided by our platform's optimized footprint and is easily deployed without complex configuration. Etcd provides the foundation and its HTTP key/value features let us build an effective Governor application for PostgreSQL.