Loading Google Analytics data to PostgreSQL using Python
PublishedDo you have Google Analytics data you'd like to load to a database for better reporting and analysis? We do. Here's a proof-of-concept script for accessing Google Analytics data and loading it to PostgreSQL using Python. Join us on the journey.
The Situation
At Compose we're evaluating a BI (business intelligence) dashboarding tool so that we can aggregate data from different sources for reporting and analysis. The particular one we're looking at connects to SQL databases but does not have API connectors to common analytics tools like Google Analytics. We happen to use Google Analytics for our website data so their recommendation is to use Google's Big Query for our Google Analytics data. A quick look nixed that option from the list - using Big Query for Google Analytics data requires you to be a Premium customer, which comes with a hefty price tag.
Instead, we're going to do this the tried-and-true way that people used before Big Query came along. We're going to connect to the Google Analytics Core Reporting API to pull data into our PostgreSQL database, where we already have our account and financial data. Early last year we wrote about how we replicate our account and finance data from MongoDB to PostgreSQL for more efficient querying and reporting. Having this data in a relational database now coupled with the data we will add from Google Analytics will give us a solid foundation for a data warehouse that any BI dashboarding tool can utilize.
Google Analytics Core Reporting API
The Google Analytics Core Reporting API is now on version 3 and has extensive functionality. The main limitation with the Core Reporting API is that it only allows up to 7 dimensions and only 10 metrics per query. If you instead use Big Query, you pretty much get the raw data with as many dimensions as apply to any metric and as many metrics as suit the query. We don't expect to have too much difficulty working around this limitation, but we wanted to point it out.
Let's also take into consideration the current quotas imposed by Google for API usage so we know what to expect:
- 50,000 requests per project per day
- 10 queries per second (QPS) per IP
- 10,000 requests per view (profile) per day - specific to the analytics APIs
- 10 concurrent requests per view (profile) - specific to the analytics APIs
What we discovered by embarking down this path is that the documentation for the Google Client API and the client libraries that have been developed to access it is frustratingly out-of-date. So, we'll share with you what we learned through trial-and-error and much perusing of online forums. There are several client libraries available for the Core Reporting API (as well as other Google apps that can be accessed via API) that have stood the test of time. There are also a handful of other client libraries that are in alpha (or beta) phase. We opted to use Python since that client was under General Availability and it is a fairly simple yet flexible language.
Authorizing with Google
The way this works is that your client library (Python for us) will actually connect and authorize via the Google Client API, which is the gateway to interacting with all of the various Google app APIs. Then through the Google Client API, you access the Core Reporting API, specifically version 3, which is called the "analytics" API in shorthand as you'll see in the client code.
According to the API Client Library documentation for Python, there are two methods for authorizing your connection:
- Simple API Key
- OAuth 2 using JSON or P12 Key
However, what we learned is that the Google Client API requires access tokens nowadays so the simple API key method just won't work so don't bother. OK then... OAuth 2 it is! There are several options for using OAuth 2 in the Google Client API:
- Web server - used for accessing a specific user's data offline
- Client-side - used for browser-based access of a specific user's data
- Installed apps - used for app packages where a user will authenticate online
- Service accounts - used for offline access of data server-to-server
The above options all make a lot of sense when you think about authorization for accessing a user's calendar, file drive, or Google+ contacts. For Google Analytics, they make somewhat less sense, although a specific user can have custom segments or dashboards they've created. For our purposes, we want the data in Google Analytics that all users with "Read & Analyze" permissions would have access to. We decided a service account was the best option for our situation.
To set up a service account, you need to create one in the Google Developers Console:
- Inititate a "project", which will give you a project ID and project number
- Go to the APIs section and click on "credentials" in the left nav
- Click "Create credentials" and choose "Service account key"
- Choose to create a "New service account" and give it a name... You'll see an email address called "Service account ID" that you'll need for the client authorization so make a note of it
- Choose JSON if you're using Python (other languages might work better with P12) and save the key file to a location where your script can access it
- You'll have the option of indicating that the account will be used for "domain-wide delegation"... the existing Python client sample expects this, but it's not necessary and, again, doesn't make a lot of sense in Google Analytics; if you choose it, you'll need to download a separate JSON client secrets file - we didn't do this
- You'll now have a service account set up, but before you leave the Developers Console, go to the "OAuth consent screen" and give your client application a name... Otherwise you'll get an application name error when you try to authorize your connection
- Finally, in your Google Analytics account, go to the "admin" section and add a user with "Read & Analyze" permissions using the email generated by the service account... Otherwise you'll get a permissions error
Now we're ready to connect to the Google API.
Connecting to Google
For some background, we're using Python 3.5.1 on Windows for this proof-of-concept. To get started we'll need to install the Google Client API module for Python:
> python -m pip install google-api-python-client
The Core Reporting API documentation references the "HelloAnalytics.py" example, but we went straight to github for the core_reporting_v3_reference.py sample. This sample has a plethora of variable output examples, besides the connection and query pieces that we'll need. Note, though, that some of the output examples use the iteritems()
method, which was deprecated in Python 3. If you're using Python 3 or higher, simply change iteritems()
to items()
- iteration through a dictionary is now built into items()
.
Be aware, also, as we mentioned above, this sample is expecting the "domain-wide delegation" with Client ID key and secret. Since we didn't do that, we made some adjustments to the connection and authorization sections of the code. We're importing and using some different modules for connecting and authorizing than what you'll see in the sample client code:
from googleapiclient.errors import HttpError
from googleapiclient import sample_tools
from oauth2client.service_account import ServiceAccountCredentials
from httplib2 import Http
from apiclient.discovery import build
# Authenticate and create the service for the Core Reporting API
credentials = ServiceAccountCredentials.from_json_keyfile_name(
'Compose-GA-xxxxxxx.json', ['https://www.googleapis.com/auth/analytics.readonly'])
http_auth = credentials.authorize(Http())
service = build('analytics', 'v3', http=http_auth)
Here we're using the ServiceAccountCredentials
method for the JSON key file we downloaded, named "Compose-GA-xxxxxxx.json" and saved to the same directory where we're running our script. We're also specifyng the "https://www.googleapis.com/auth/analytics.readonly" scope in our scopes list for the service. We're then authorizing via HTTP and creating our service using build
for the "analytics" API (the shorthand name for the Core Reporting API mentioned previously) for "v3".
Now that we've got our service connection created, let's build our query for Google Analytics.
Querying for Google Analytics Data
Building a query for the API is a little bit different than what you may be familiar with if you've only used the Google Analytics UI. We suggest you should get familiar with the API parameters by accessing the Google Query Explorer to test your queries and results before you try to run them programmatically. The query parameter field values you add in the Query Explorer can help you construct the query in your script.
Once you've retrieved results you like from the Query Explorer, you'll see the URL with encoded query parameters at the bottom of that page. That's what the API call being made looks like. If you follow that link, you'll see what the returned results look like from the API.
We created a function to return aggregated traffic data for a month (we recommend you try queries that retrieve only a single row until you are satisfied the data is being processed to the database the way you expect). Here's our query function:
def get_api_traffic_query(service):
return service.data().ga().get(
ids='ga:xxxxxxx',
start_date='2014-01-01',
end_date='2014-01-31',
metrics='ga:users,ga:sessions',
dimensions='ga:yearMonth',
# sort='-ga:yearMonth',
# filters='ga:pagePath=~signup',
segment='sessions::condition::ga:hostname!~mongo|app|help|docs|staging|googleweblight',
start_index='1',
max_results='25')
For our query, we're calling get()
for the Google Analytics data via the service connection we created above. We're referencing the profile ID (now called the "view ID") from the Google Analytics view we want to access. You can find it on the View properties page in the "admin" section of your Google Analytics account.
We then have the start and end date for the period we are interested in, the metrics we want to retrieve (users and sessions), the dimensions to apply (in this case only yearMonth to return a single row based on our dates), and then also a dynamic segment that excludes some subdomains and alternative hostnames that we are not interested in for this query. We don't need to use sort or filters for this one, but we've provided some examples of how each can be used.
Finally, we want to start with the first returned row on page 1 of our results (that's the start_index) and we're capping the results at a max of 25. The Core Reporting API will provide up to 10,000 results per page so you can set max_results to 10,000, but if you have more than that, you'll need to construct a loop in your script to cycle through each page using the next page's start_index. For example, if you set max_results at 10,000 and you have more than 10,00 results, your start_index for page 2 will be 10,001.
Now we've got our service connection and our query. Let's move on to getting PostgreSQL setup.
Getting PostgreSQL Ready
We assume you already have a PostgreSQL deployment on Compose. If you don't, you can add a deployment from the Deployments page in the Compose management console. If you don't yet have an account, signup to get started using Compose.
Based on our query above, we're going to retrieve three fields: yearMonth, users, and sessions. We'll need a table to store these, but first let's create a database using the "Create database" link from the Compose management console data browser for your PostgreSQL deployment. We created a database called "ga_data":
Then, using whatever client you prefer, create a table for the fields you expect to be returned. You can create the table from within your Python script if that makes sense for your situation. We're going to just use the pgAdmin GUI and run the CREATE TABLE
statement in our new database so our Google Analytics data has a place to go:
CREATE TABLE traffic (
yearMonth INT PRIMARY KEY,
users INT,
sessions INT
)
;
Now that our data has a place to go, how do we get it there?
Psycopg2
Psycopg2 is the current PostgreSQL driver for Python and we'll need it for connecting to our Postgres database and loading the data from Google Analytics. Install the module:
> python -m pip install psycopg2
If you're on Windows and running Python 3 or above, you may have some difficulty installing the module. For that situation, get the Windows binaries and install the .whl file using pip install
instead.
Once the module is installed, we'll add a couple import statements to the top of our script:
import psycopg2
import sys
Now let's connect to Postgres.
Connecting to PostgreSQL
First, we'll need to get our connection information from the Overview page in the Compose management console (click "show/change" to reveal the admin password):
Now, let's plug that in to our script:
# Define the connection string and connect
conn_string = "host='aws-us-east-1-portal.8.dblayer.com' port=10221 dbname='ga_data' user='admin' password='password'"
conn = psycopg2.connect(conn_string)
# Open a cursor
cursor = conn.cursor()
# Some other stuff will happen here
# Close the cursor and the connection
cursor.close()
conn.close()
Above, we're creating a connection object using psycopg2 and passing in the connection string which contains the connection information. We're then opening a cursor which we can use to execute SQL statements. That's where we'll load our data from Google Analytics (which we'll get into in the next section). After we're done, we'll close the cursor and close the connection.
Inserting Google Analytics Data
Alright, this is the moment of truth - pulling data from one application and loading it into another!
In the code snippet from the section above, there was a comment stating "Some other stuff will happen here"... Well, here's the "other stuff":
# Run the query function using the API service
traffic_results = get_api_traffic_query(service).execute()
# Insert each row of the result set
if traffic_results.get('rows', []):
for row in traffic_results.get('rows'):
cursor.execute("""INSERT INTO traffic (yearMonth, users, sessions)
VALUES(%s, %s, %s)""", [row[0], row[1], row[2]])
else:
print('No Rows Found')
# Commit changes
conn.commit()
What we're doing here is executing our query function and saving the results to "traffic_results". We're then looping through each returned row in "traffic_results" and using our cursor to insert the values of the row. Since we only have one row because of the way we constructed our query, it's pretty straightforward. For inserting a lot of rows, you'll want to explore bulk insert options.
Note that if you have more than one dimension you may want to test for the order in which the Core Reporting API returns the fields to you. We only have one dimension (yearMonth) and it is returned as the first field and then our metrics in the order we listed them in our query. If you have more than one dimension, though, you'll need to test which order they'll come in.
Finally, we're committing our changes to the database.
Putting It All Together
Now that we've walked through all the pieces, here's what the whole script looks like:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import psycopg2
import sys
from googleapiclient.errors import HttpError
from googleapiclient import sample_tools
from oauth2client.service_account import ServiceAccountCredentials
from httplib2 import Http
from apiclient.discovery import build
#Main
def main():
# Authenticate and create the service for the Core Reporting API
credentials = ServiceAccountCredentials.from_json_keyfile_name(
'Compose-GA-xxxxxx.json', ['https://www.googleapis.com/auth/analytics.readonly'])
http_auth = credentials.authorize(Http())
service = build('analytics', 'v3', http=http_auth)
# Define the connection string and connect
conn_string = "host='aws-us-east-1-portal.8.dblayer.com' port=10221 dbname='ga_data' user='admin' password='password'"
conn = psycopg2.connect(conn_string)
# Open a cursor
cursor = conn.cursor()
# Run the query function using the API service
traffic_results = get_api_traffic_query(service).execute()
# Insert each row of the result set
if traffic_results.get('rows', []):
for row in traffic_results.get('rows'):
#print(row)
cursor.execute("""INSERT INTO traffic (yearMonth, users, sessions)
VALUES(%s, %s, %s)""", [row[0], row[1], row[2]])
else:
print('No Rows Found')
# Commit changes
conn.commit()
# Select and retrieve results
#cursor.execute("SELECT * FROM traffic")
#records = cursor.fetchall()
#print(records)
# Close the cursor and the connection
cursor.close()
conn.close()
# Query function
def get_api_traffic_query(service):
return service.data().ga().get(
ids='ga:xxxxxx',
start_date='2014-01-01',
end_date='2014-01-31',
metrics='ga:users,ga:sessions',
dimensions='ga:yearMonth',
# sort='-ga:yearMonth',
# filters='ga:pagePath=~signup',
segment='sessions::condition::ga:hostname!~mongo|app|help|docs|staging|googleweblight',
start_index='1',
max_results='25')
if __name__ == '__main__':
main()
Note that we did not break up pieces of the code into additional functions (as you normally would do in Python) in order to keep it simple for following end-to-end in this article.
To run the script from the command line (we named out script "googledata_test.py"):
> python googledata_test.py
No errors! Let's check our data!
Checking the Data
After we run the script, we can look in the Compose management console at the Table page for the "traffic" table and we see our results have been loaded for the month of January 2014:
You could also uncomment the following bit in your script after the conn.commit()
to return the row in the command line interface:
# Select and retrieve results
cursor.execute("SELECT * FROM traffic")
records = cursor.fetchall()
print(records)
Note that the fetchall()
will return all the rows in a list - each row is one list element if you have more than one row.
Wrapping Up
We hope this proof-of-concept demonstrates how you can use Python to pull data from Google Analytics using the Core Reporting API and then load that data into PostgreSQL to better facilitate reporting and analytics. For real-world data, you'll need to adapt the script for pagination (if you have more than 10,000 results for one query), for automating the query generation for incremental time ranges (to pull the data each day or month), for optimizing for bulk insert, and schedule the script to run automatically as needed for your situation.
Now, go get your Google Analytics data!