Visualizing Data with Jupyter Notebooks, PixieDust, and Compose MongoDB

Connecting to Compose MongoDB and creating rich presentations for your data inside a Jupyter notebook is made easier with PixieDust. We'll show you how to get started with PixieDust without much code involved to give you more insights into the Titanic data set.

Notebooks aren't just for Python coders and data scientists. In this article, we'll introduce you to PixieDust and use it to open Jupyter notebooks using Node.js, MongoDB and rich visualizations.

PixieDust is an extension to the Jupyter Notebook which adds a wide range of functionality to easily create customized visualizations from your data sets with little code involved. For this example, we're going to look at two elements of that: PixieDust-Node and PixieDust's display call, with data from the Titanic.

Let's start first with importing the data into MongoDB ...

Importing Titanic into MongoDB

The same Titanic data set that we covered in our previous article Getting Started with Compose PostgreSQL and Jupyter Notebooks will be used for our example. It's a partial list of passengers who survived or perished on the Titanic and is hosted on the Stanford University Computer Science department's CS109 website. To download the list, click titanic.csv on that website to download the passenger list.

We'll assume you have a Compose MongoDB deployment set up. Make sure you also have MongoDB installed locally, since we'll be using the mongoimport command that comes with the MongoDB distribution. If you're running macOS, you can install MongoDB using Homebrew from your terminal with brew install mongodb. If you're not running macOS, select your operating system and follow the installation instructions provided by MongoDB.

From the terminal, use the mongoimport command to import the Titanic CSV file into your Compose MongoDB deployment.

mongoimport --host aws-us-west-1-portal.26.dblayer.com --port 33333 -u admin -p mypass --ssl --sslAllowInvalidCertificates -db titanic -c passengers --type csv --file titanic.csv --headerline  

Replace the host, port, username, password, and path to your CSV file in the above example with your own. Using the -db flag, we'll create a new database called "titanic", and use the -c flag to specify the name of the data collection as "passengers". Once you've executed this command, you should have about 891 documents inserted into the database, which should look like:

{
    "_id" : ObjectId("59c01027cbbbfab7d828f2f1"),
    "PassengerId" : 1,
    "Survived" : 0,
    "Pclass" : 3,
    "Name" : "Braund, Mr. Owen Harris",
    "Sex" : "male",
    "Age" : 22,
    "SibSp" : 1,
    "Parch" : 0,
    "Ticket" : "A/5 21171",
    "Fare" : 7.25,
    "Cabin" : "",
    "Embarked" : "S"
}

Getting started with Jupyter and PixieDust

In the previous article, we installed and used a Jupyter notebook using a Docker image. Now, we'll show you how to run Jupyter through the Anaconda data science platform. Jupyter as well as popular Python data science packages like Matplotlib, NumPy, Pandas, and others, come pre-installed on Anaconda, which makes it easy to set up without having to install individual packages.

To download Anaconda, navigate to the Anaconda distribution installer and select the Python 2.7 version. We'll need to run our Jupyter notebook with Python 2.7 because PixieDust-Node requires it. It might take a while to download the installer, but once it's finished, just click through the installation process to get Anaconda set up. After Anaconda is set up, click on the Anaconda Navigator icon int he folder where your applications are held (in macOS it will be in the Application folder). That will open up Anaconda which will look like this:

There, you will see several applications. Jupyter is installed for us, so go ahead an click Launch under Jupyter. Once you've clicked that, the browser will automatically and it'll open up a directory running on http://localhost:8888/. To create a new notebook, click on the New button in the upper-right-hand-corner which will give you several notebook types to choose from:

Select Python 2 and a new browser tab will open up with a fresh notebook running Python 2. To install PixieDust, write and run the following in a notebook cell:

!pip install pixiedust

This will install PixieDust and import all of its dependencies. Now, in another cell just import it running:

import pixiedust  

And now we're ready to set up PixieDust-Node ...

Installing and Importing Node and PixeDust-Node

With a Python 2 notebook opened up in the browser, we can install and import PixieDust-Node. A prerequisite to running PixieDust-Node is to have Node.js installed since PixieDust-Node uses the Node.js runtime. Since we're using Anaconda, we'll have to install Node.js within out Anaconda environment. To do that, go back to the Anaconda Navigator and select Environments from the left-hand menu. A new window will appear where you can search for packages that are Not installed from the drop-down menu like:

Then in the search box, type in "node". This will show a Node.js package. Click on the package, then click the Apply button at the bottom of the navigator. Once that's clicked, a pop-up window will appear confirming that you want to install Node.js. Click the Apply button in this window, too.

When that's done, relaunch Jupyter and then go back to your notebook. Let's now install PixieDust-Node. Add another cell in your notebook and write:

!pip install pixiedust-node

Once it's installed, we just have to import it with:

import pixiedust-node  

Now we're all set up. With PixieDust-Node installed and imported into the notebook, we now have access to the Node.js runtime and its package ecosystem, npm.

Connecting to Compose MongoDB

To install any Node.js package from npm, use the following npm.install command with a package name:

npm.install("<library_name>")  

We'll install the MongoDB Node.js driver to connect to Compose MongoDB. In a new cell, we'll install the package with:

npm.install("mongodb")  

Once the driver has been installed, create a new cell so that we can write some code to connect to MongoDB. For an example of how to set up a connection to Compose MongoDB using Node.js, take a look at the Compose Grand Tour database connection guide for Node and MongoDB. That connection guide provides us with a basic example of how to connect to Compose MongoDB, which we'll then run in the cell:

%%node
const MongoClient = require('mongodb').MongoClient;

let connectionString = "mongodb://<admin>:<mypass>@aws-us-west-1-portal.26.dblayer.com:33333,aws-us-west-1-portal.27.dblayer.com:33333/admin";

let options = {  
  ssl: true,
  sslValidate: false,
};

let mongodb;

MongoClient.connect(url, options, (err, db) => {  
    if (err) throw err;
    mongodb = db.db("titanic")
});

First, we'll require the mongodb driver from the mongodb npm package. Next, substitute the connectionString with your Compose MongoDB deployment connection string and user credentials. Keep the values in the options variable as they are. Create a variable called mongodb, which we'll use to set up a connection pool so that we can reuse the database connection throughout the notebook. Then, we'll set up a connection using the MongoClient.connect method and place in the values from the connection string and SSL options. Within the function, we'll also assign the connection to the "titanic" database to the mongodb variable we just created. That way we have access to the "titanic" database and the "passenger" collection that contains our data set.

Once that's set up, run the cell and we'll have a connection to the database.

Let's now connect to the "passengers" collection. To do that, we'll create another variable called col inside a notebook cell and run that so we can query the "passengers" collection.

%%node
let col = mongodb.collection("passengers");  

Querying MongoDB with PixieDust

With access to the "passengers" collection, we can start using the MongoDB driver's collection methods to query the collection and use PixieDust to create the visualizations.

Let's write a basic query and run that in a new cell that will get the number of passengers who survived and are at least age 50 and above:

%%node
col.find({"Survived": 1, "Age":{ $gte: 50}}).toArray((err, results) => {  
    if (err) throw err;
    display(results)
});

We'll send the results of the query to PixieDust using its display method. This method gives us access to PixieDust's display API and will construct a table from the results. We can then use that table to create a customized chart from the table dashboard.

From the PixieDust dashboard, we have the option to select the type of chart we want to display in the cell using a drop-down window. The buttons to the left and the right of that let you refresh your table's data and download a CSV file of the results.

When selecting the bar chart option from the drop-down menu, a pop-up window will appear. On the left, we're presented with a fields window containing data fields and their types gathered by PixieDust from the results. On the right, we have key and values windows where we can drag and drop those fields into. To create a chart from the query, drag the "Sex" field to the keys window and the "Survived" field to the values window.

Click OK at the bottom of the window. This will replace the table that was produced in the cell with a bar chart that looks similar to this:

Using the bar chart options on the left side, you can adjust the orientation and size of the chart, and show the various age differences in more detail. We changed bar type to "stacked", the orientation to "horizontal", and clustered the values by "Age":

For a more complex example, we can use MongoDB's aggregation pipeline. Let's create a query that will group the number of males and females who survived or perished by class. Your query might look like:

%%node
col.aggregate([  
    {$project: 
        {
        _id: 0, Pclass: 1, Sex: 1, Survived: 1,
        first_class_alive: {$cond: [{$and:[{$eq: ["$Survived", 1]}, {$eq: ["$Pclass", 1]} ] }, 1, 0]},
        first_class_perished: {$cond: [{$and:[{$eq: ["$Survived", 0]}, {$eq: ["$Pclass", 1]} ] }, 1, 0]},
        second_class_alive: {$cond: [{$and:[{$eq: ["$Survived", 1]}, {$eq: ["$Pclass", 2]} ] }, 1, 0]},
        second_class_perished: {$cond: [{$and:[{$eq: ["$Survived", 0]}, {$eq: ["$Pclass", 2]} ] }, 1, 0]},
        third_class_alive: {$cond: [{$and:[{$eq: ["$Survived", 1]}, {$eq: ["$Pclass", 3]} ] }, 1, 0]},
        third_class_perished: {$cond: [{$and:[{$eq: ["$Survived", 0]}, {$eq: ["$Pclass", 3]} ] }, 1, 0]},
        }
    },
    {$group: 
        {
        _id: "$Sex",
        count: {$sum: 1},
        first_class_alive: {$sum: "$first_class_alive"},
        first_class_perished: {$sum: "$first_class_perished"},
        second_class_alive: {$sum: "$second_class_alive"},
        second_class_perished: {$sum: "$second_class_perished"},
        third_class_alive: {$sum: "$third_class_alive"},
        third_class_perished: {$sum: "$third_class_perished"},
        total_survived: {$sum: {$cond: [{$eq: ["$Survived", 1]}, 1, 0]}},
        total_perished: {$sum: {$cond: [{$eq: ["$Survived", 0]}, 1, 0]}},
        }
    }
], (err, results) =>{
    if (err) throw err;
    display(results)
});

Again, we'll feed in the results to PixieDust using display. The results in the Mongo Shell would look like:

{ "_id" : "female", "count" : 314, "first_class_alive" : 91, "first_class_perished" : 3, "second_class_alive" : 70, "second_class_perished" : 6, "third_class_alive" : 72, "third_class_perished" : 72, "total_survived" : 233, "total_perished" : 81 }
{ "_id" : "male", "count" : 577, "first_class_alive" : 45, "first_class_perished" : 77, "second_class_alive" : 17, "second_class_perished" : 91, "third_class_alive" : 47, "third_class_perished" : 300, "total_survived" : 109, "total_perished" : 468 }

With display, we'll get a chart like the following:

We can select a chart type again so that we can present the data. Select the bar chart and the pop-up window will appear. For this chart, we've selected the "id" field as the _key, corresponding to a passenger's sex, and then for the values we've selected all of the fields related to the passenger classes.

Once OK is clicked, we'll have a chart that looks similar to this:

We can modify the chart again, for example, by changing the orientation to "horizontal" and the type to "stacked", giving us:

And that's all it takes to create some interesting visualization in a Jupyter notebook.

For further practice, we suggest playing with the various options that you have available to create your own interactive charts with the Titanic data set. Or, if you're comfortable with using PixieDust, MongoDB, and PixieDust-Node, try connecting to MongoDB and start producing your own customized visualizations for your data.

Summary

In this article, we showed you how to connect your Compose MongoDB deployment to a Jupyter notebook and easily create charts using PixieDust. With a single method display(), we could create charts easily and without the hassle of writing copious amounts of code to get a visually pleasing chart. We also connected to MongoDB and created queries using Node.js from a PixieDust extension called PixieDust-Node. This extension allowed us to use JavaScript and import Node.js libraries right within the Jupyter environment. Without writing any Python, except for a couple import commands, we connected to MongoDB and queried the database using JavaScript within the IPython cells. For inexperienced users of Python, but who have a background in writing JavaScript, PixieDust-Node will definitely help with the learning curve of using Jupyter notebooks and opens up the Jupyter environment to programmers who might not be fluent in Python, Scala, or R.


If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com. We're happy to hear from you.

attribution Yeshi Kangrang