Cross-Database Querying in Compose PostgreSQL

In Compose PostgreSQL you can now perform cross-database queries using some extensions we've recently made available: postgres_fdw and dblink. In this article we'll walk through how to set up your database to take advantage of them.

Both the postgres_fdw and dblink extensions permit you to query, update, insert, or delete data in one PostgreSQL database from a different one, but they work in different ways. We'll take a closer look at each one below.

postgres_fdw

The PostgreSQL foreign data wrapper, postgres_fdw, is now available for new deployments running PostgreSQL 9.5.3. postgres_fdw is the first foreign data wrapper the Compose development team is allowing for customer use after passing our security evaluation, though additional foreign data wrappers are expected to follow to support interaction with other Compose databases. The postgres_fdw extension is essentially a PostgreSQL-to-PostgreSQL connector for databases which may be on the same or different hosts. Let's set one up.

1) Install the extension

You'll first need to install the extension. PostgreSQL extensions is one of the new features we introduced earlier this month.

In the Compose administrative console, select your PostgreSQL deployment and then Browser. Next, click on the database where you want to install the foreign data wrapper. From there, click on Extensions in the menu on the left and scroll down to postgres_fdw. On the far right you'll see the "install" button. Click it and wait a few seconds for the extension to be installed. Once it's installed, it will have the green indicator and the button on the far right will now say "remove" (if you ever want to remove the extension). You'll notice a plpgsql extension is already installed.

alt=extensions

Note that if you instead try to run the CREATE EXTENSION SQL command for postgres_fdw, you will be able to create the extension in the database, but it won't have the underlying functions required for it to run on the Compose platform. If you've done this, you'll need to run DROP EXTENSION and then install the extension through the Compose console in order to use postgres_fdw in your Compose PostgreSQL database.

When you install postgres_fdw through the Compose console, it automatically grants usage on the foreign data wrapper to the admin user.

2) Create the server connection

Next, via your SQL interface of choice, you'll need to run the CREATE SERVER command as the admin user to set up the connection to the foreign database. You'll need to provide the host and port information even if the database where you installed the foreign data wrapper is on the same host as the database you're connecting to. Here's what ours looks like:

CREATE SERVER segment  
FOREIGN DATA WRAPPER postgres_fdw  
OPTIONS (host 'aws-us-east-1-portal.0.dblayer.com', dbname 'segment', port '10100');  

We've named our server connection "segment" since we're going to be connecting to our Segment warehouse database, which we implemented at the end of last year.

3) Create the user mapping

Now, we'll run CREATE USER MAPPING for our server connection. The user you're creating the mapping for will be a user with permissions in the current database (we're just using our admin user), but the user and password you supply in the options of this command needs to have the required permissions for the foreign database. Let's run it:

CREATE USER MAPPING FOR admin  
SERVER segment  
OPTIONS (user '<foreign_db_user>', password '<foreign_db_user_password>');  

4) Import the data

So, now we've got the connection set up and the user credentials we'll need to access the foreign database. The next step is importing the data. We can either import individual tables or, new in PostgreSQL 9.5, we can import the schema containing all the tables and views, or limiting to just a few we specify.

To import individual tables, use the CREATE FGOREIGN TABLE command with the table columns defined that you want to import (you can add them all or only a few), like this:

CREATE FOREIGN TABLE aliases (  
  id character varying(254) NOT NULL,
  received_at timestamp with time zone,
  context_library_name text,
  context_library_version text,
  original_timestamp timestamp with time zone,
  previous_id text,
  sent_at timestamp with time zone,
  user_id text,
  "timestamp" timestamp with time zone
)
SERVER segment  
OPTIONS (schema_name 'production', table_name 'aliases');  

Note that our schema name in the segment database is called "production". Use whatever schema name your database uses for the table you're importing... typically this will be the "public" schema.

If you don't want to import individual tables one-by-one, then import the schema instead using IMPORT FOREIGN SCHEMA:

IMPORT FOREIGN SCHEMA production  
LIMIT TO (tracks, pages)  
FROM SERVER segment INTO public;  

In this example, we're importing the "production" schema from our segment database, we're limiting the import to just the two tables we're most interested in (one is called "tracks" and one is called "pages"), and we're importing into our current schema, which is the "public" schema.

The LIMIT TO clause allows you to specify only the tables or views you want from the schema. If you don't use it then all the tables and views from the schema will be imported. That can be a bonus if you don't want to have to do CREATE FOREIGN TABLE for each and every table or view you want. The other great thing about using IMPORT FOREIGN SCHEMA rather than IMPORT FOREIGN TABLE is that, if the table structure changes in the foreign database (such as a new column being added or one being dropped), those changes are automatically carried over. With CREATE FOREIGN TABLE, you're defining a static table structure so, to accommodate changes from the foreign database, you'd have to alter the table or drop it and re-import it with the new definition.

5) Query your data

Now we've got a couple tables from our segment database available to us in our current database, which in our situation is our internal data warehouse that contains our accounts information, among other things. We can now write queries to tie the data together from these two databases in order to get richer reports. Here's an example of a simple query that joins the tracks table from the segment database to the accounts table in our current database to give us a count of the number of times each account has signed in (we're tracking "Sign In" events via Segment):

SELECT a.id, a.name, COUNT(t.user_id) as occurrences  
FROM accounts a  
JOIN tracks t ON t.user_id = a.id  
WHERE event_text = 'Sign In'  
ORDER BY occurrences DESC;  

Trying to run a query like this without having the foreign data wrapper set up (or dblink, discussed below) would yield a "cross-database references are not implemented" error. Luckily we don't need to worry about that anymore!

How to check your setup

Keep in mind that the postgres_fdw extension will only work for new deployments running the 9.5.3 version for the admin user. If you try to perform the above steps in earlier deployments, you'll get "permission denied" errors.

Also, you won't see the foreign tables (or any database reference) in the Compose data browser for your foreign data so it can be difficult to remember what server connections you've created and what tables are available. To remind yourself, you can look in a couple of different system catalogs: pg_foreign_server for the servers and pg_foreign_table for the tables.

Let's check the server connections we have implemented:

SELECT * FROM pg_foreign_server;  

Here's our segment database connection:

srvname | srvowner | srvfdw | srvtype | srvversion | srvacl | srvoptions  
--------------------------------------------------------------------------------------------------------------------------------------
segment | 16384    | 16399  |         |            |        | {host=aws-us-east-1-portal.0.dblayer.com,dbname=segment,port=10100}  

The owner and fdw columns for the server contain the internal object ID reference for the admin user as owner and the postgres_fdw extension as the fdw. The type, version, and acl columns are NULL because we did not specify those parameters when we created the server connection, though you can specify them if you need to.

Let's look at the tables available:

SELECT * FROM pg_foreign_table;  

We get the following:

ftrelid | ftserver | ftoptions  
----------------------------------------------------------------
16469   | 16464    | {schema_name=production,table_name=aliases}  
16472   | 16464    | {schema_name=production,table_name=pages}  
16475   | 16464    | {schema_name=production,table_name=tracks}  

Even though we added the aliases table using CREATE FOREIGN TABLE and we added the pages and tracks tables using CREATE FOREIGN SCHEMA, both show up with this query the same way. We see that the tables in this case are all from the same foreign server connection (16464, the internal object ID reference for the connection to the segment database) and that each of the tables has its own relation ID (again, an internal object ID reference). Now if we add other server connections and more tables, we can run these two queries to help keep ourselves straight on what we've done.

Read, but don't touch

So far we've only looked at querying the data in our foreign tables using SELECT, but we could also perform INSERT, UPDATE, and DELETE statements on the data. Depending on your situation, this may be one of the benefits of the foreign data wrapper, but if you'd prefer "read only" access for the foreign data, you can change that behavior with the updatable option.

The updatable option can be used for the server connection as a whole to prevent any table from being updated (by default updatable is "true" when the server connection is created). We'd set it to "false" with the following command:

ALTER SERVER segment  
OPTIONS (ADD updatable 'false');  

We can override the server-level configuration by setting updatable on any individual table. To set a specific foreign table as "read only", we'll alter the table by adding the updatable option set to "false" (if you'd already set the server option as "false", you could use this to set a specific table to "true" to allow updates for only that table):

ALTER FOREIGN TABLE aliases  
OPTIONS (ADD updatable 'false');  

We're all set now to query across a couple different PostgreSQL databases using the postgres_fdw extension. Next, we'll look at how to do the same thing, but using the dblink extension instead.

Though dblink predates foreign data wrappers in PostgreSQL for Postgres-to-Postgres connections, it's also newly available on the Compose platform for your Postgres deployments. Let's look at how to use it.

1) Install the extension

Just like with postgres_fdw, you'll need to install the module from the Compose console Extensions page in order for it to work properly in your Compose PostgreSQL deployment. In the Compose administrative console, navigate to the database where you want to install dblink, then click the "install" button on the far right:

alt=extensions

Just like with postgres_fdw, this will install some functions (a whole bunch for this extension!) on the backside and grant the appropriate permissions that allow this extension to work on the Compose platform.

2) Connect to the server

One key difference with foreign data wrappers and dblink is that, with the foreign data wrapper, a server connection only needs to be created once and then it remains available across sessions until you run a DROP SERVER command. With dblink, the server connection is good for the life of the current session only, or until you disconnect it during the current session. Once you disconnect from your database server, the dblink server connection is gone so you have to create it anew for each session where you want to use it.

To create a connection, use the dblink_connect() command, first specifying the name of the connection (ours is "segment") followed by a comma and then a list of field-value pairs for all of the required server connection parameters:

SELECT dblink_connect('segment', 'host=aws-us-east-1-portal.0.dblayer.com port=10100 schema=production dbname=segment user=<foreign_db_user> password=<foreign_db_user_password>');  

If the command is successful, we'll get an "OK" response.

3) Query your data

There are a handful of options for how to query data using dblink. We'll have a look at each of them and you can decide which one works best for your situation.

Direct query

A direct query is the most straightforward. It allows us to run a SQL SELECT statement from within the dblink command and the results are returned as they become available.

SELECT *  
FROM dblink('segment', 'SELECT event_text, COUNT(user_id) AS user_count  
                        FROM production.tracks
                        WHERE event_text = ''Sign In''
                        GROUP BY event_text;')
AS event_count(event_text text, user_count integer);  

In this example, we're passing in the name of our server connection "segment", then the query we want to run. We then have to specify field names and data types for the output table, which we're calling "event_count". The table acts as a derived table which can be joined to other tables in your current database. From this query example, though, we don't have anything we'd join on. What we'll get returned is the total number of times the "Sign In" event has occurred.

Note that because there was not a schema parameter for our server connection in the previous step, we need to include the schema name with the table name, "production.tracks" in this case. Also, note the duplicated single quotation marks around the text value "Sign In". These are needed to escape the quotation marks since we're running a query within a query.

Query by cursor

If our situation calls for cycling through the results in a more methodical way rather than having the result set returned as it becomes available, we can open a cursor.

Open the cursor using the dblink_open command containing a SELECT statement with the fields we'll want to retrieve:

SELECT dblink_open('events', 'SELECT id, event_text, user_id  
                              FROM production.tracks
                              WHERE event_text = ''Sign In'';');

We've name our open cursor "events". Next, we'll retrieve one row at a time using dblink_fetch:

SELECT id, event_text, user_id FROM dblink_fetch('events',1)  
AS (id character varying(254), event_text text, user_id text);  

Note that we have also specified the field names and data types for the resulting row.

Once we've performed whatever row-by-row processing we wanted to do with our open cursor, we need to close it using the dblink_close command:

SELECT dblink_close('events');  
Perform an async query

The final option for querying with dblink is the async (asynchronous) query. In this option no results are returned until the full result set is compiled.

There are two parts to an async query - sending the query and then getting the result set.

Use dblink_send_query for sending the request:

SELECT *  
FROM dblink_send_query('segment', 'SELECT id, event_text, user_id  
                                   FROM production.tracks
                                   WHERE event_text = ''Sign In'';') AS events;

The results from the query above, when available, will be in a table called "events". Then we'll use dblink_get_result to retrieve the results:

SELECT *  
FROM dblink_get_result('segment')  
AS events(id character varying(254), event_text text, user_id text);  

We can use another command to be notified when our results are ready. If we're waiting too long, we can use another dblink command to determine if the server is busy with our request or a different command to cancel our query altogether.

4) Disconnect

Finally, when we're done with the server connection, we need to disconnect:

SELECT dblink_disconnect('segment');  

Again, we'll get an "OK" response for a successful disconnect.

As mentioned, the Postgres foreign data wrapper is newer to PostgreSQL so it tends to be the recommended method. While the functionality in the dblink extension is similar to that in the foreign data wrapper, the Postgres foreign data wrapper is more SQL standard compliant and can provide improved performance over dblink connections.

Also, unlike the foreign data wrapper, dblink does not have a way to make the data "read only". The data in the foreign tables can be modified by building and executing an INSERT, UPDATE, or DELETE statement using the dblink commands defined for these operations. If your situation calls for the foreign data to not be modifiable, then you'll want to stick with the foreign data wrapper and make sure to set the updatable option appropriately.

Then there's the connection persistence. dblink connections are only good during a given session and need to be recreated each time. The foreign data wrapper establishes a permanent connection. This could be advantageous or disruptive, according to your needs.

On the plus side for dblink, it has a slew of specialized commands and lends itself easily to programmatic processing.

Another current benefit of dblink over postgres_fdw is that it will work on earlier Compose PostgreSQL deployment versions (we ran ours on a deployment sporting the 9.4 version). This can be an option for those who don't want to spin up a new PostgreSQL deployment in order to use postgres_fdw.

Regardless of which method will work best for you at this time, we're happy to be able to provide them both to you. Now you can query across your different PostgreSQL databases with ease. Stay tuned for more cross-database query options coming in the near future.