Connecting and loading data to PostgreSQL with Perl


Let's walk through a handy little Perl script that connects and loads a CSV data file to PostgreSQL. We'll also see how to run a simple query on the data after it's loaded. We assume you have some familiarity with PostgreSQL and a bit more with Perl, but we think it's also pretty easy learning if you're just getting started.

Getting Ready

If you want to try this at home, you'll need to make sure you have the Perl DBI package and the DBD::Pg driver for PostgreSQL installed in your Perl distribution's library. You can get both of these from CPAN or MetaCPAN.

For our examples here, we are running the 1.634 version of DBI and the 3.5.3 version of DBD::Pg on the strawberry perl distribution on Windows. We installed the modules using the cpan command line interface in strawberry perl by running install DBI and, once that was finished, running install DBD::Pg. These modules didn't give us any hassle during the install, which always makes us happy (less mature or multi-dependency modules can sometimes be cranky).

Now that we have DBI installed in our Perl library, we'll reference it at the beginning of our script:

use DBI;  

DBI (which stands for "Database Interface") is a really rich package with a lot of database-related functionality built-in. DBD::Pg (DBD stands for "Database Driver") augments DBI with Postgres-specific attributes and methods.

Getting the Data

For this article, we've downloaded the US Census population estimates CSV file to a local directory from the American FactFinder application because we'll be using the data for a statistical report we'll be creating in future. As you'll see, though, with some minor changes you could use this script for loading any delimited data file to PostgreSQL.

A quick review of the CSV file shows us that it has 58 rows, including the header row, with 91 fields. Almost all of the data is numeric and there are no null values. There is only one text field - "NAME" - and the values are not quoted. The last 30 fields are decimal.

If you work with data, you'll know it's never perfectly clean. We did find two values for Puerto Rico that were populated as an "X" instead of an integer in an otherwise numeric field so we manually changed those to a "0" in the file to suit our purposes.

Alright, now we know what we've got to work with!


Now that we've got the data we'll be loading, let's connect to our database. In preparation for loading this data, we created a database called "census" via the Compose data browser. Note that the "compose" database is created automatically for every PostgreSQL deployment, but we wanted to create one to hold census data with the expectation we'll be adding more than just population estimates over time.


We'll start our script by putting all the connection information we'll need into variables, which we can get from the connection string in the Compose overview page for our deployment.


The only difference is that we'll be accessing the new "census" database instead of the "compose" database.

# Connection config
my $dbname = 'census';  
my $host = '';  
my $port = 10221;  
my $username = 'user';  
my $password = 'password';  

Next, we'll create our DB handle object by connecting:

# Create DB handle object by connecting
my $dbh = DBI -> connect("dbi:Pg:dbname=$dbname;host=$host;port=$port",  
                            {AutoCommit => 0, RaiseError => 1}
                         ) or die $DBI::errstr;

Here, you can see we're calling DBI with the connect method and then telling it to use the DBD::Pg driver to connect to our database at the configured host and port. We're also passing it our login credentials and setting some attributes for the connection. In this case, we're telling it not to do auto commits so that if there is an error, it will automatically rollback changes. We've also asked it to raise errors so we're aware of them. Note that RaiseError will cause the script to die after the rollback if errors are encountered. Alternatively, you could use PrintError which only gives warnings but does not cause the script to die. RaiseError is typically the better way to go so that you don't introduce issues into your database. Finally, if the connection is not successful, we want the script to die and give us the error.

We're also going to create a trace log so that all database interactions via our connection are being logged for debugging purposes:

# Trace to a file
$dbh -> trace(1, 'tracelog.txt');

Note that trace (which we're using here) and the related trace_msg should not be confused with pg_server_trace (and the corresponding pg_server_untrace). These PostgreSQL-specific tracers write very verbose and fairly cryptic server debugging information from PostgreSQL. Unless you have some expertise with this information and a use case where tracing it make sense, don't bother with these PostgreSQL utilities.

Create the Table

Now we'll create our data table. We're going to name it "population". Knowing there are 91 fields we might actually create several different tables from this data, but to just keep things simple for the purposes of this article, we'll put the whole data file into a single table. Also, because we scoped out the data types for each of the fields when we reviewed the file, it's a straightforward matter to just write a CREATE statement for the whole thing using the header row for the field names.

First, however, we'll check and see if the table already exists and drop it if it does:

# Drop table if it already exists
my $SQL = "DROP TABLE IF EXISTS population";  
my $sth = $dbh->do($SQL);  
$dbh->commit or die $DBI::errstr;

Here we've written a DROP TABLE IF EXISTS SQL statement and stored it to a variable called $SQL that we'll re-use for each of our SQL statements. We then call the do method on it using our database connection handle object $dbh and creating a statement handle $sth for the database. Finally, we're telling our connection to commit or die with an error.

Now our CREATE TABLE statement, which we've truncated here at the ". . . ." so as not to have show you 91 fields!

# Create a table
my $SQL = "CREATE TABLE population (  
                  ID SERIAL PRIMARY KEY,
                  SUMLEV INTEGER,
                  REGION INTEGER,
                  DIVISION INTEGER,
                  STATE INTEGER,
                  NAME VARCHAR(100),
                  CENSUS2010POP INTEGER,
                  . . . .
                  RNETMIG2014 DOUBLE PRECISION,
my $sth = $dbh->do($SQL);  
$dbh->commit or die $DBI::errstr;

We're using four different data types, SERIAL for our primary key, INTEGER for the numeric fields, VARCHAR for the one text field, and DOUBLE PRECISION for the decimal fields. Our primary key field "ID" is something we're adding on load to PostgreSQL which will auto-increment for each row inserted - it's not something that exists in the CSV file. Again, we're calling the do method on the SQL statement and then committing.

From our trace log, we can see that our table was created successfully and "1 items" was committed:

    <- do('DROP TABLE IF EXISTS population')= ( '0E0' ) [1 items] at line 22
    <- commit= ( 1 ) [1 items] at line 23
    <- do('CREATE TABLE population (
                  ID SERIAL PRIMARY KEY,
                  SUMLEV INTEGER,
                  REGION INTEGER,
                  DIVISION INTEGER,
                  STATE INTEGER,
                  NAME VARCHAR(100),
                  CENSUS2010POP INTEGER,
                  ESTIMATESBASE2010 INTEGER,
                  POPESTIMATE2010 INTEGER,
                  POPESTIMATE2011 INTEGER,
                  POPESTIMATE2012 INTEGER,
                  POPESTIMATE2013 INTEGER,
                  POPESTIMATE2014 INTEGER,
                  POPESTIMATE2015 INTEGER,
                  NPOPCHG_2010 INTEGER,
                  ...')= ( '0E0' ) [1 items] at line 120
    <- commit= ( 1 ) [1 items] at line 121

Loading the Data

So now we've got our table ready... Let's load our file! One way we could do this is to parse each row and create an INSERT statement for it. That's a bit tedious, however. Luckily, PostgreSQL has a pqsql utility called \copy that we can use for importing (aka "copying") a delimited file into a table. We briefly touched on this in part 3 of our CSV series. Note that if we try to use the more direct COPY function in PostgreSQL we won't be allowed since that requires superuser privileges and access to the server file system. It doesn't matter, though, because the \copy will do just what we need. In the DBD::Pg driver module, this functionality is made available to us through the COPY... FROM STDIN statement. Here's what that looks like (again we've truncated the fields we're showing you here with a ". . . ."):

# Copy from STDIN into the table
my $SQL = "COPY population (  
                . . . .
my $sth = $dbh->do($SQL);  

We've written our SQL statement and named each of the fields to be found in the file to make it clear which fields exist and in which order (i.e., the "ID" field for our primary key does not exist in the file). We've also specified that the data is delimited with a comma, that it's a CSV file, and that it has a header row. We then call the do method on it. At this point, though, we're not committing anything because we haven't actually loaded any data yet. That part comes next as we read the file line-by-line:

# Read in the file contents one line at a time
open(my $infile, 'NST-EST2015-alldata.csv') or die $DBI::errstr;  
while (my $line = <$infile>) {  
$dbh->commit or die $DBI::errstr;

Even though we're pulling in each line in the file one-by-one, remember that the reason we're going this route is that the amount of processing needed here is much easier than parsing each row in the file and creating an INSERT statement for it and each of its field values.

In the above code, using standard Perl commands, we've opened the CSV file and are then reading it in one line at a time. The PostgreSQL-specific method we're using to load each line into the table is pg_putcopydata. The newline character at the end of each line is automatically stripped out and, because our COPY... FROM STDIN statement specified the delimiter as a comma, it knows where each field begins and ends.

When we've finished reading in all the lines in the file, we use the pg_putcopyend method to tell it that we're done. We then commit our changes in the database and close the file.

Our trace log will show us each line from the file that was read and loaded (again, we've truncated here with ". . . ."):

    <- pg_putcopydata('10,0,0,0,United States,308745538,. . . .3.567445092,3.593568285
')= ( 1 ) [1 items] at line 222  

Voila! We've got data in our table as we can see from the Compose data browser:


Query the Data and Disconnect

Before we disconnect, let's run a simple SELECT statement on the table and print our results to STDOUT:

# Query from the table and display to STDOUT
my $SQL = "SELECT id, name, popestimate2015 FROM population WHERE name = ?";  
my $sth = $dbh->prepare($SQL);  
$sth -> execute('United States');
print "id\tname\tpopestimate2015\n";  
while (my @row = $sth->fetchrow_array) {  
   print "$row[0]\t$row[1]\t$row[2]\n";

As you'll notice, we've got some new methods here. First of all, though, we've written a SQL statement with a "?" as a placeholder for the value in our WHERE clause. There are a couple different placeholder options for a SELECT statement as well as binding parameters that will allow you to manipulate the data type being passed. We've chosen the simplest option here.

We're then using the prepare method on our statement followed by an execute on the statement handle $sth supplying "United States" as the value. Because SELECT statements return results (whereas the statements we ran above did not), we have to do this two-step process of preparing the SQL and then executing it using the statement handle. We then use the statement handle to return our results to us.

So that it's easy to see which values we're going to display from our query, we're printing out a header row to STDOUT. Next we're using a while loop to process each row that gets returned to the @row array from the statement handle fetchrow_array method. Each result will be printed to STDOUT. Finally, we're calling the finish method on the statement handle to let the database know we're done with it.

Our results to STDOUT look like this:

id      name    popestimate2015  
1       United States   321418820  

So, we've connected to PostgreSQL, created a table, loaded a data file in CSV format to the table, and successfully queried the data. Now, let's disconnect:

# Disconnect
$dbh -> disconnect;

Next Steps

For this article, we wanted to keep it simple and not get too much into the Perl weeds for less experienced readers, but if you're clever with Perl and you want to spend a little time with arrays and regular expressions, you could certainly automate the creation of the table from the header row and discern basic data types based on the row values. You'd then have a generalized script for any basic delimited file you wanted to load into PostgreSQL.

For us, hardcoding some of the statements in our script got us the immediate results we wanted - the census population estimates data loaded into a table that we can query and run statistics on.

We hope that this article has demonstrated how easy it is to connect to PostgreSQL with Perl, load data from a file into a table, and then query that data. With these basic building blocks, you can develop a host of useful applications. You may even want to venture into using PL/Perl, which allows you to develop and run stored procedures on PostgreSQL with some built-in Perl functionality. Go to it!

Conquer the Data Layer

Spend your time developing apps, not managing databases.