Accessing your Compose PostgreSQL data in a Jupyter Notebook has never been so easy. You'll have the chance to take a deep dive into a Titanic data set.
Notebooks are becoming the essential tool for data scientists allowing them to write, edit dynamic documents that contain code, visualizations and explanatory text on the web. Jupyter Notebook is a web application that enables that and the sharing of those documents between users. Jupyter is a language agnostic version of the open-source project IPython Notebook which developed the notebook concept for Python users.
In this article, we'll look at how to get started setting up a Jupyter Notebook and then show you how you can connect to Compose PostgreSQL within the notebook to start querying your database. If you want to take a look at the entire notebook, it's viewable in our Titanic Github repository.
The data set we'll be using is a partial list of passengers on the Titanic hosted on the Stanford University Computer Science department's CS109 website. Click titanic.csv to download the passenger list.
Login to your PostgreSQL deployment using psql to create a table for the passengers. Substitute your deployment's host and port for your own:
psql "sslmode=require host=aws-us-west-2-portal.2.dblayer.com port=16394 dbname=compose user=admin"
Before we can import the data, we'll create a table like the following and call it "passengers". We've added an id field to give the table a Primary Key:
CREATE TABLE passengers ( id SERIAL NOT NULL PRIMARY KEY, survived INT, pclass INT, name VARCHAR(255), sex TEXT, age FLOAT8, siblings_spouses INT, parents_children INT, fare FLOAT8 );
Once the table is created, we'll import the CSV data over to PostgreSQL using psql's
\copy meta-command. Substitute the
/path/to/titanic.csv with the path of your CSV file.
\copy passengers (survived, pclass, name, sex, age, siblings_spouses, parents_children, fare) from '/path/to/titanic.csv' CSV HEADER;
That will insert 887 passengers into your database that we can now work with. So, let's move now to downloading Jupyter and create our first notebook.
There are three easy ways you can set up a Jupyter notebook:
- You can download Anaconda data science tools package, which contains Jupyter and Python.
- Existing Python 3 users can use
pip3 install jupyterto install it.
- If you use Docker, then the easiest and cleanest setup is to run it from a Docker image.
If you have Docker already installed, you're good to go. Otherwise, install the Docker Community Edition and head off to the orientation guide to help you get started with the basics.
Jupyter's Docker hub has a number of versions available according to your needs. For example, there's the "jupyter/minimal-notebook", "jupyter/all-spark-notebook", "jupyter/datascience-notebook", and others. Each repository comes with commonly used libraries that are pre-installed inside a Docker container. For instance, the data science notebook comes with Python 3 and R as well as scikit libraries and the tidyverse R packages that include ggplot2, dplyr, purrr, and others.
The Docker image that we'll use is the "jupyter/minimal-notebook". This comes only with Python 3 installed; therefore, all other libraries that we'll use must be installed in the notebook.
Running Jupyter with Docker
In order to set up a Jupyter Notebook, it's advised to first create a folder that will store your notebooks. We've created a folder called "Notebooks". It doesn't matter where you place this folder, but you'll need to make note of its location because you'll use it when telling Docker and Jupyter where to save your notebooks.
Once you've set up a folder to store notebooks and have Docker set up, run the following command in your terminal:
docker run -it --rm -p 8888:8888 -v /Users/ajalger/Notebooks:/home/jovyan jupyter/minimal-notebook
What this command does is tell Docker to set up a new container for "jupyter/minimal-notebook", which will download it as a Docker image. Once that's downloaded, it sets the port of the Docker container to 8888 and used the
-it flag to run the container interactively and uses
--rm to remove the container once the Jupyter notebook has been stopped. The -v flag lets you mount a directory as a directory inside the Docker container. Here, we are mounting the "Notebooks" directory as the container's default user's home directory. This is connected to
/home/jovyan, which is the directory we access in the Jupyter image that'll show us a list of the notebooks we've created.
Once that command is executed in the terminal, it will run continuously until you stop it. That's because the notebook runs like a web server and it will automatically save your notebook each time you run a command or make a change. Let it run until you're finished making changes to your notebook.
Next, scroll down to where you see a URL like:
Copy/paste this URL into your browser when you connect for the first time, to login with a token: http://localhost:8888/?token=aca5a46914df21521a08d33f6e6eed8ecab19ef6bf9f6dc9
To secure your container, when you connect for the first time, you need to present a token that's embedded in that URL. Use that URL in your browser to access your Jupyter Notebook and you'll be greeted with this user interface:
Files is the tab from we're currently on that shows a list of notebooks in our "Notebooks" folder. To create a new notebook select New on the right-hand corner and select Python 3 from the menu.
Once you've done that, a new tab in your browser should open and you'll have an untitled notebook like:
Below the menubar and toolbar is the main view of the Notebook. The Notebook is made up of cells which can contain code or Markdown and the first cell is displayed ready for you to enter some Python code into it. You can change the type of a cell from the cell menu, move the currently selected cell up and down, step through and execute cells and, as it's an editor, cut, paste and copy cells.
If we want our notebook to have a heading, we change the cell type to Markdown then enter "# The Titanic Report" and click plus to move onto a new cell.
Adding Compose PostgreSQL to Jupyter
Now that you know the basics of working with a Jupyter Notebook, let's add some code to interact with our Compose PostgreSQL deployment using the Titanic data we imported previously.
We'll be using the Python library Psycopg2 to connect to PostgreSQL. We'll also use the library pprint, which is a printing library that will make the results of our query to PostgreSQL easier to read. You can install these libraries into the notebook by using the
!pip install command in a code cell. Create one and add:
!pip install psycopg2 !pip install pprint
Click the play button in the toolbar to run the cell. You should get something like this:
Each library that you want to add you'll have to install it using the
!pip format unlike
pip in your terminal. Once the libraries have been installed, we'll import them into the notebook in a code cell. Create another one and add:
import psycopg2 import pprint
Press the play button again to run the code. Now, we'll set up a connection to our Compose PostgreSQL database. For that, use the URI that comes with your PostgreSQL deployment to create a connection to the database using the Psycopg2 library. Create a new code cell then add:
conn = psycopg2.connect("postgres://[admin]:[password]@aws-us-west-2-portal.2.dblayer.com:16394/compose")
Run this again with the play button.
Querying PostgreSQL in Jupyter
To set up a query, we'll need to use the
cursor method then execute the SQL query using the cursor. Then, we'll set up a query to give us the average age of passengers who survived the Titanic disaster by ticket class. We'll then use the
fetchall method to gather the results and store them in a variable called
rows. After that, remember to close the connection to the database. So, create a new code cell and the following:
cursor = conn.cursor() cursor.execute('SELECT avg(age), pclass FROM passengers WHERE survived = 1 GROUP BY pclass') rows = cursor.fetchall() conn.close()
Once you run that code with the play button, we can now print the results with pprint. Create another code cell and add:
Running that with play will give us an array of results like:
[(35.9626470588235, 1), (26.1704597701149, 2), (21.4110924369748, 3)]
The other option is to loop through the rows and print out each one-by-one using Python's
print() method. Creating another code cell, we'd add:
for row in rows: print("avg age:", row, "class:", row)
And running that by pressing play again would give us:
avg age: 35.9626470588235 class: 1 avg age: 26.1704597701149 class: 2 avg age: 21.4110924369748 class: 3
These are just two ways that you can get the results of the query. Other things you could do is import data science libraries like Pandas and Matplotlib to do sophisticated data analysis and generate rich visualizations that can be stored in the notebook, too. Notebooks provide an unlimited amount of possibilities for you to explore that can easily be shared with others.
This was just an introduction to using Compose PostgreSQL with Jupyter Notebooks. And it's never been so simple. All we had to do is import the appropriate connection library and use our Compose PostgreSQL connection string to start querying right away. In another article on notebooks, we'll explore how to use some data science libraries within Jupyter Notebooks some of the other Compose databases.
If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at firstname.lastname@example.org. We're happy to hear from you.