Better Decision Making with Watson Machine Learning and Compose


Using data and machine learning can profoundly affect the quality of decisions that you make in business and life. You no longer need to be a stellar data scientist or machine learning guru, Watson Machine Learning and Compose have you covered.

Machine learning, "Big Data", and data science are hot topics. However, machine learning has been something that was only accessible by a small few who knew how to implement complex learning algorithms into code. With Watson Machine Learning (ML), now you can create, train, and deploy self-learning models with the click of a button. Even better, you can use the data from your existing Compose database to gain better data insights and make better decisions, too.

In this article, we'll explore how to set up Watson Machine Learning to predict which passengers survived the Titanic shipwreck using the Titanic: Machine Learning from Disaster dataset on Kaggle. Once we've trained Watson ML on some training data to create a model, we'll then use the test dataset to find out how well our ML model performs to predict which passengers survived the Titanic shipwreck. As a bonus, we'll upload that data into the Kaggle Titanic competition to see how well we do.

Let's start learning ...

The Preliminaries: Downloading Data and Setting up Compose PostgreSQL

You'll need to create a Kaggle account to download the three CSV files needed for our example:

  1. train.csv - the data we'll use to train Watson ML in order to build a customized machine learning model for the Titanic passenger list.
  2. test.csv - the test data we'll use to determine how well the Watson ML model performs on unseen data.
  3. gender_submission.csv - the sample answers to the test data that we'll use to compare to Watson's.

After downloading the data, you'll need a Compose PostgreSQL database. We'll assume that you created one. Once you're logged into your deployment, use your PostgreSQL credentials to log into the psql shell from the terminal. From here, we'll create two tables: one for the data in test.csv and the other from gender_submission.csv.

We'll call the first table passenger_list which has all the columns from the test.csv file. A description of each field can be found here. We've also added a survived column to add in the results we'll get from Watson ML later. The other table we'll call survived_test which holds the gender_submission.csv data.

create table passenger_list (  
  passenger_id int not null primary key,                                                                                                                                                                                     
  pclass int,                                                                                                                                                                                                                
  name text,                                                                                                                                                                                                                 
  sex text,                                                                                                                                                                                                                  
  age float,                                                                                                                                                                                                                 
  sibsp int, 
  parch int,                                                                                                                                                                                                               
  ticket varchar,                                                                                                                                                                                                            
  fare float,                                                                                                                                                                                                                
  cabin varchar,                                                                                                                                                                                                             
  embarked text,
  survived int

create table survived_test (  
  passenger_id int not null primary key,
  survived int 

Copy the data from the CSV files over to the tables with:

\copy passenger_list  (passenger_id, pclass, name, sex, age, sibsp, parch, ticket, fare, cabin, embarked) from '/path/to/test.csv' CSV HEADER

\copy survived_list from '/path/to/Downloads/gender_submission.csv' CSV HEADER 

Now that the tables have been created and the data copied over, let's set up Watson ML with IBM's Data Science Experience ...

Setting up Watson Machine Learning using IBM's Data Science Experience

To set up Watson ML, you need to have an IBM Bluemix account which you'll automatically get when you sign up to IBM's Data Science Experience (DSX) on You'll get a 30-day trial when you set it up and get access to various Watson APIs for the trial period. The reason we're setting up Watson ML on DSX and not Bluemix is that DSX supports the creation of customized Watson ML models.

If you don't have an account, click Sign up on the DSX homepage. If you already have one, let's continue on. Once you've signed up or logged in, you'll see your DSX projects page where you can view your projects or set up new ones. Let's create one now by clicking the Create Project button under Recently Updated Projects, or you can click Create new and select Project from the menu in the upper-right-hand corner of the page.

Once inside, we'll see the following where we have to create a "name" for the project. In this case, we're calling it "Titanic". We'll need to create an Apache Spark instance because Watson ML cooperates with Spark for different functions. Also, we'll need an Object Storage instance since we need to store the training data from a CSV file into our project. Watson ML currently only accepts CSV files to train on. Don't worry, we can do everything from this window.

Once you've filled in the title (the description is optional) click the link under "Spark Service" that says "Create a new IBM Analytics for Apache Spark instance". That will take you to the following page that looks like:

We'll select the "Lite" plan since it's free. Then click Buy Apache Spark - remember, it's free! Once you've clicked on it, it will ask for a "Service Name". You can name it whatever you like; we've called ours "Titanic Spark" - then click Confirm which will take you back to your project page.

If you don't see your new Spark instance on the page, don't worry. It might take a minute or two to show up. Just click on the reload button under "Spark Service" and it will show up once it has provisioned.

Now, we'll have to set up an Object Storage instance for our project. The default is "Object Storage (Swift API)". Use that one for now. Under it click Create a new instance, which will take you to another page to set it up like we did with Spark:

Again, select the "Lite" plan because free is good, and then click Buy IBM Object Storage. Once again, it will ask you to create a name for this instance; we've called ours "Titanic object storage" - then click Confirm which will take you back to your project page.

You also might have to click Reload before the storage instance shows up. You'll see the name of your project within "Target Container" under your instance. This will always be the name of your project and just shows you what the Object Storage is attached to. Once you're done, click on Create at the bottom of the screen.

When your project has been created, you get a project page that looks like:

The page comes with a navigation bar where you can choose to attach data sets, add Jupyter notebooks, add analytics assets like Watson Machine Learning, add collaborators who use DSX, and even connect your project to Github or publish Gists. Click on Analytics Assets where you'll see:

Next to "Models" click add models to the right. Here, we're going to add in Watson ML to create a Watson ML model.

You must create a "Name" for the model; we called ours "Titanic Model". Also, since we don't have a Machine Learning Service set up, we'll have to do that. Click the link Associate a Machine Learning service instance to set one up. That will take you to another screen where you can select the "Free" plan again:

Then click Buy IBM Watson Machine Learning which will then ask for you to create a name for the service; we've called ours "Titanic Machine Learning".

Back on the model page, you'll have to choose whether you want Watson ML to select the best algorithm that works for your data or do it manually. With manual you can select one or more algorithms, run them, and see which one has better performance.

We'll select "Manual" and then click Create at the bottom of the window. Now, we are taken to another screen to "Select data asset" to train Watson ML with.

Click on the button with 1s and 0s on the top right corner of the window. That will open up a side window where you can drag and drop the train.csv file into. The file will be stored in our Object Storage instance. We could create a database connection from here, but currently, we cannot ingest data from a database to create ML training data. Database connections, however, can be used in notebooks. Once the CSV file is uploaded, you'll see the file appear in the window like above. Select the file and click Next.

You should be in the training window that asks you to "Select a technique". This is where you will select the column that needs to be evaluated to predict outcomes, and select feature columns to help Watson ML learn how to make them.

In the first drop-down "Select Label Col", click and find "Survived". Under "Feature columns" you'll want to select "Pclass", "Sex", "SibSp", "Parch", and "Fare". We left out "PassengerId" because it's just a constant for each individual, and "Cabin" since most of the values are missing. "Embarked" has also been left out because it just refers to the port of embarkation, and "Age" since it's a value that doesn't seem to determine whether you survived. As for the "Name" column, we could do some feature engineering to see what other data relationships we could come up with, such as separating titles from the names because prestige may have impacted the likelihood of survival. However, for this article, we'll focus on just generating a basic model for now.

Once that's done, select Binary Classification. The binary classification model will give us prediction scores, stemming from the decision whether the "survival" value (here 0 or 1) is positive or negative, and then we can compare the score against the survival values from the survived_test table.

For the "Validation Split", we can keep it as it is. If you'd like to play with the settings to see how that affects how good the model is, try it out. After that, we need to click on Add Estimators in the top-right corner.

These are the ML algorithms we can use with binary classification. From the pop-up window, select all of them then click Add.

You should be back into your training window now and see the estimators that you selected now listed. Click Next and you can watch them train in real-time:

Once they've trained, you'll see the estimator type that has the best performance at the top of the list based on the area under ROC curve.

Select the best estimator and then click Save. When your model is saved, you'll be taken to the model's "Overview" page showing you the model "Summary" and the "Input Schema".

The "Input Schema" is important, especially when you analyze data that might have null values. Values that are not the correct data type will error out when running data against the model.

In the navigation menu at the top, we have "Evaluation", which gives you the model's results which are the results we got when Watson ML was training. Also, we have "Deployments", which is what we'll go to when we want to deploy the model and use it in an application - you can only deploy one model but save several. Finally, "Test" is where we test our data before deploying it to see if we're getting the right results. Go ahead and click "Test". You'll see the first record from the training data set populating the "Prediction input data" fields. You can click on Predict to run the test data or change the data in the fields to whatever you like. We've randomly selected passenger 915 from test.csv to test the result.

Once the model is tested, let's deploy it to use with applications. Click "Deployments" on the top menu and once you're in that window, select Add Deployment on the right.

It will ask you for the deployment type, which is "online", and for a name; we've called ours Titanic Deployment - then click Deploy. Now you'll see the Titanic Deployment appear.

Click on that. Here's where your deployment's API endpoint is located.

Clicking on the link View API specification will give the details on how to connect to the endpoint using your Watson ML credentials. After that, click the link at the top of the specification that takes you to the IBM Bluemix documentation. There you will find detailed information on the necessary steps to get your credentials.

We'll need the service credentials to set up an application that can interact with Watson ML. In the next steps, we'll create a command-line Python application to do just that.

Building a Python Application with Watson Machine Learning

A template for building a Python application using Watson ML is found here and we'll use that continuing forward. You'll have to be logged into DSX or Bluemix to access it.

We'll need to modify the template slightly for our use case, specifically the payload_scoring variable. To set up the Python application, open up a Python file in your favorite editor. Then import the following libraries along with your Watson ML credentials which should look similar to:

import urllib3, requests, json

  "url": "",
  "access_key": "***",
  "username": "***",
  "password": "***",
  "instance_id": "***"

From the template, copy the following as well:

headers = urllib3.util.make_headers(basic_auth="{username}:{password}".format(username=wml_credentials["username"], password=wml_credentials["password"]))  
url = "{}/v3/identity/token".format(wml_credentials["url"])  
response = requests.get(url, headers=headers)  
mltoken = json.loads(response.text).get("token")

header = {"Content-Type": "application/json", "Authorization": "Bearer " + mltoken}  

These lines will build the appropriate request and store the access token we'll need to use for Watson ML. Now we'll add another variable we'll call scoring_url which will hold the Titanic Deployment endpoint that was given to us when we deployed the Titanic Model online like:

scoring_url = ""  

Those preliminaries are set up, but we'll have to get our test data that was stored in our Compose PostgreSQL database. Here, we'll import the psycopg2 library and store the deployment's credentials in a dictionary.

import psycopg2

compose = {  

We'll then use those credentials as inputs in a connection string and connect to our database then fetch all of the results like:

conn_string = "host={host} port={port} dbname={database} user={user} password={password}".format(host=compose["host"], port=compose["port"], database=compose["database"], user=compose["user"], password=compose["password"])  
conn = pg.connect(conn_string)

cursor = conn.cursor()  
cursor.execute("SELECT * FROM passenger_list")  
results = cursor.fetchall()  

With the results stored, we'll have to modify the data a little before we can pass it to Watson ML. To do that, we've come up with a couple of functions and we'll walk you through those. The first function get_passenger_payload gets the data from PostgreSQL and modifies it into a JSON document. That document will be used as the payload when sending a request to Watson ML.

def get_passenger_payload(results):  
    passengers = list()

    for result in results:

    for passenger in passengers:
        # convert all names to empty string due to formatting issues
        if passenger[2]:
            passenger[2] = ""
        # convert all None values to 0 for the age
        if passenger[4] is None:
            passenger[4] = 0
        # convert fares that have no value to 0    
        if passenger[8] is None:
            passenger[8] = 0
        # convert classes with a None value to an empty string    
        if passenger[9] is None:
            passenger[9] = ""

    return {"fields": ["PassengerId", "Pclass", "Name", "Sex", "Age", "SibSp", "Parch", "Ticket", "Fare", "Cabin", "Embarked"], "values": passengers}

full_payload = get_passenger_payload(results)  

Since fetchall returns a tuple, we'll convert passenger values to a list then append each passenger to the passengers list. We have multiple passengers who need to be scored. Watson ML requires those values to be passed into a JSON document with a "fields" key representing the input schema and a "values" key set to a 2D list of passenger data.

In order to avoid formatting errors, we've taken the liberty to change the "Name" of each passenger to an empty string. Also, we've populated "Age" with a 0 if none is given, and changed "Class" to an empty string if a null value is found there, too. "Fare" is the only value the Watson ML model looks that we've also converted (passenger[8]) since one passenger has a null value. This is an insignificant change for this exercise because we change only one value in the entire dataset.

The next function sends the payload to Watson ML and returns the results:

def run_ml_titanic(payload):  
    result_scores =, json=payload, headers=header)
    return result_scores

passenger_scores = run_ml_titanic(full_payload)  

Viewing the results of that with something like result_scores.test will give you back a JSON document that looks like:

  "fields": ["PassengerId", "Pclass", "Name", "Sex", "Age", "SibSp", "Parch", "Ticket", "Fare", "Cabin", "Embarked", "features", "rawPrediction", "probability", "prediction", "nodeADP_class", "nodeADP_classes"],
  "values": [[1195, 3, "", "male", 24.000000, 0, 0, "315092", 8.662500, "", "S", [3.624166185858382, 0.0, 0.0, 0.0, 0.1852364586246106], [17.233219279112507, 2.7667807208874935], [0.8616609639556254, 0.13833903604437467], 0.0, 0.0, [0.0, 1.0]], ...

With this, we need to get the PassengerId and probability values so we can assign a 0 or 1 to each passenger, depending on whether they perished or survived, respectively. The last function we've written will do just that and return a list of tuples with the PassengerId and the survival value.

def parse_results(scores):  
    # we need to load in the JSON document we got back from Watson ML and look only at the "values" key 
    ml_results = json.loads(scores.text)
    passenger_survival = list()

    for value in ml_results['values']:
        survived = 0
        passenger_id = value[0]

        # value[13] is the probability 
        if value[13][0] < value[13][1]:
            survived = 1
        passenger_survival.append((passenger_id, survived))

    return passenger_survival

survival = parse_results(passenger_scores)  

To see the performance of our model, we'll send the data back to our PostgreSQL database and update the values of the survived column. To do that, we'll execute another query using psycopg2, run commit on the database connection, then close the database connection cursor. Remember to use commit, the data will not be saved to PostgreSQL.

    UPDATE passenger_list 
        SET survived = s.survived 
    FROM unnest(%s) s(p_id int, survived int) 
        WHERE passenger_list.passenger_id = s.p_id
, (survival,))


Accessing your database, you should see your survived column populated with the survival values. Joining this table with the survived_test table should give us an indication of how well the Watson ML model performed. It turns out, Watson didn't do too bad; out of 419 passengers only 39 were incorrectly classed and most of them were women who were 3rd class passengers. This might mean that Watson ML determined that most 3rd class passengers perished, and therefore these women probably perished as well.

For an added bonus, if you want to see how our Watson ML model does in the Kaggle Titanic competition, save the results in a CSV file with the following code added into your Python file then submit it. We got a score of 0.77511 without doing any real data manipulation, which is not that bad.

with open("titanic_results.csv", "w") as csvfile:  
    write_file = csv.writer(csvfile)
    write_file.writerow(['PassengerId', 'Survived'])
    for passenger in survival:
        p_id = passenger[0]
        prediction = passenger[1]
        write_file.writerow([p_id, prediction])


Watson ML offers a plethora of possibilities for gaining insights into your data to make smarter decisions. As you can see, without knowing a lot about data science or machine learning, we were able to come up with a fairly reliable machine learning model to make valid predictions on a small dataset. With larger datasets to train on, it's reasonable to assume that Watson ML will give us better results. With that said, you should give Watson ML a try because it makes machine learning and data science even more accessible to the layperson, decision maker, and novice data scientist.

Read more articles about Compose databases - use our Curated Collections Guide for articles on each database type. If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at We're happy to hear from you.

attribution William Bout

Abdullah Alger
Abdullah Alger is a former University lecturer who likes to dig into code, show people how to use and abuse technology, talk about GIS, and fish when the conditions are right. Coffee is in his DNA. Love this article? Head over to Abdullah Alger’s author page to keep reading.

Conquer the Data Layer

Spend your time developing apps, not managing databases.