Graph 101: Getting Started with Graphs

Published

The Graph101 series is an introduction to Graph Databases for developers. Whether you're a seasoned expert looking to expand your knowledge, or a newbie who's first dipping your toe into the data layer, the Graph 101 series will help you find your bearings.

Graph databases are an excellent way to model complex relationships in dynamic systems, but it can be tough trying to figure out whether a graph database is the right answer for you. It can be even tougher understanding how to leverage existing graph data sets into your application.

In this article, we'll take a look at some existing data sets that you can use to evaluate whether a graph database will fit your needs. We'll also look at the various ways in which Graph databases are stored, and learn how to import these data sets into Compose JanusGraph.

Let's get started....

How Do You Know A Graph Is Your Answer?

A lot of data lends itself to graphing—even hierarchies. Humans want to believe that hierarchies exist and that the world is structured, but it’s not; it’s really unstructured. Graph databases are designed to depict the unstructured nature of life.

Graphs model relationships between different entities. However, so do traditional relational databases. It can be challenging trying to decide whether or not your data is suitable for storage in a graph database. So how can you tell the difference?

While there isn't necessarily a right or wrong answer here, there are a few indicators that your data may be well suited to a graph database.

While relational databases are good for establishing simple relationships between entities, they are poor at describing those relationships. If the relationships between entities are important or complex, this can be one indicator that a graph database is a good choice. The relationship between users in a Social Network is a good example because people are connected to each other in varying and complex ways.

Another good use case for graph databases is creating dynamic relationships between entities, such as a product recommendation engine that shows products that were co-bought by other users of an e-commerce store.

Example Graph Data Sets

There are many prominent examples from open-source data sets that we can explore. The Stanford Large Network Data set Collection can give us an idea about the types of data that can be stored in a Graph database. The data sets range from networks of Amazon co-purchased products to a Twitter meme tracker.

The International Consortium of Investigative Journalists (ICIJ) hosts a downloadable graph database of the infamous Panama Papers that contains connections between prominent individuals and illicit offshore bank accounts.

A departure from the first two is the data hosted at LinkedData.org, which establishes a graph data set across many different websites using semantic web tags. LinkedData contains a list of dozens of websites with related information and establishes relationships between those sites and data points.

Analyzing Graph Data Sets

Graph data sets come in many different formats, so we'll have to make sure we understand how those work.

The most common format for transmitting and transferring graph databases is via multiple CSV files with one representing the nodes in the graph, a second representing the edges in the graph, and generally a third containing metadata. This is how most of the files in the SNAP data set are stored.

The edge and node files can be simple or highly complex. Let's take a look at two contrasting data sets: A list of related products on Amazon, and the Panama Papers from the ICIJ.

The Amazon data set has simple edges which directly relate one node to another with little other information. A simple Tab-Separated-Values (TSV) file of edges, from the Amazon data set, might look like this:

# Edges
# Directed graph (each unordered pair of nodes is saved once): Amazon0302.txt 
# Amazon product co-purchasing network from March 02 2003
# Nodes: 262111 Edges: 1234877
# FromNodeId    ToNodeId
0    1  
0    2  
0    3  
0    4  
0    5  
1    0  
1    2  
1    4  
1    5  
1    15  
2    0

Meanwhile, the nodes are highly complex data records that might be structured like the following:

# Edges
Id:   15  
ASIN: 1559362022  
  title: Wake Up and Smell the Coffee
  group: Book
  salesrank: 518927
  similar: 5  1559360968  1559361247  1559360828  1559361018  0743214552
  categories: 3
   |Books[283155]|Subjects[1000]|Literature & Fiction[17]|Drama[2159]|United States[2160]
   |Books[283155]|Subjects[1000]|Arts & Photography[1]|Performing Arts[521000]|Theater[2154]|General[2218]
   |Books[283155]|Subjects[1000]|Literature & Fiction[17]|Authors, A-Z[70021]|( B )[70023]|Bogosian, Eric[70116]
  reviews: total: 8  downloaded: 8  avg rating: 4
    2002-5-13  cutomer: A2IGOA66Y6O8TQ  rating: 5  votes:   3  helpful:   2
    2002-6-17  cutomer: A2OIN4AUH84KNE  rating: 5  votes:   2  helpful:   1
    2003-1-2  cutomer: A2HN382JNT1CIU  rating: 1  votes:   6  helpful:   1
===

In the Amazon file, the edges represent a simple relationship between nodes - a simple to and from field are sufficient to represent an edge.

The Panama Papers have a more complex set of edges and relationships. They contain data of their own, including the source of a connection and a complex set of relationship types:

node_1,rel_type,node_2,sourceID,valid_until,start_date,end_date  
56283,Nominee Director of,122638,Offshore Leaks,The Offshore Leaks data is current through 2010,,  
55254,Nominee Shareholder of,52675,Offshore Leaks,The Offshore Leaks data is current through 2010,,  
55560,Nominee Shareholder of,99767,Offshore Leaks,The Offshore Leaks data is current through 2010,,  
55737,Nominee Protector of,122532,Offshore Leaks,The Offshore Leaks data is current through 2010,,  
55737,Nominee Protector of,122653,Offshore Leaks,The Offshore Leaks data is current through 2010,,  
55737,Nominee Protector of,68508,Offshore Leaks,The Offshore Leaks data is current through 2010,,  
55737,Nominee Protector of,108327,Offshore Leaks,The Offshore Leaks data is current through 2010,,  
56917,Nominee Shareholder of,122728,Offshore Leaks,The Offshore Leaks data is current through 2010,,  
56917,Nominee Shareholder of,117719,Offshore Leaks,The Offshore Leaks data is current through 2010,,  
56917,Nominee Shareholder of,87629,Offshore Leaks,The Offshore Leaks data is current through 2010,,  
56917,Nominee Shareholder of,118997,Offshore Leaks,The  

Meanwhile, the nodes representing Officers in these companies represented in the Panama Papers data set are less complex than the Amazon Nodes:

name,icij_id,valid_until,country_codes,countries,node_id,sourceID,note  
KIM SOO IN,E72326DEA50F1A9C2876E112AAEB42BC,The Panama Papers data is current through 2015,KOR,South Korea,12000001,Panama Papers,  
Tian Yuan,58287E0FD37852000D9D5AB8B27A2581,The Panama Papers data is current through 2015,CHN,China,12000002,Panama Papers,  
GREGORY JOHN SOLOMON,F476011509FD5C2EF98E9B1D74913CCE,The Panama Papers data is current through 2015,AUS,Australia,12000003,Panama Papers,  
MATSUDA MASUMI,974F420B2324A23EAF46F20E178AF52C,The Panama Papers data is current through 2015,JPN,Japan,12000004,Panama Papers,  
HO THUY NGA,06A0FC92656D09F63D966FE7BD076A45,The Panama Papers data is current through 2015,VNM,Viet Nam,12000005,Panama Papers,  
RACHMAT ARIFIN,14BCB3A8F783A319511E6C5EF5F4BB30,The Panama Papers data is current through 2015,AUS,Australia,12000006,Panama Papers,  
TAN SUN-HUA,C3912EA62746F395A64FB216BE464F61,The Panama Papers data is current through 2015,PHL,Philippines,12000007,Panama Papers,  
Ou Yang Yet-Sing and Chang Ko,DB896EE47F60BB1B2E9EA9C10ACBFCD7,The Panama Papers data is current through 2015,TWN,Taiwan,12000008,Panama Papers,  

Looking at these data sets, it's clear that a "one-size-fits-all" approach doesn't make sense in graph database settings. Let's take a look at how we can import and manipulate data sets.

Importing Data sets

Since Data sets are so widely varied, there's no standard way of importing graph data sets into a graph database. Every graph database has its own method for importing data, and there are different strategies depending on the size of the graph. For this article, we'll use the Amazon related products data set from Stanford SNAP.

Compose JanusGraph doesn't have the ability for you to load files onto Compose servers, so the file plugin is disabled on the remote server. In order to import files, we'll use a technique that parses the files in our local JanusGraph console and creates a script to build the graph from those files. The script can then be run on our remote Compose JanusGraph instance.

Compose JanusGraph uses the Gremlin console, which allows developers to write import scripts using the Groovy programming language. These import scripts are parsers that loop through the lines of each file and create edges and vertices in JanusGraph programmatically based on the data in that file. Since each file is different, we'll have to create a new parser for every data set we want to import.

Writing the parser

First, if you haven't already, create a JanusGraph deployment in Compose and download and install the Gremlin console. If you haven't used Compose JanusGraph before, check out our previous article on creating markov chains in JanusGraph for a good introduction. We'll be using the Gremlin console to create our script.

Next, Download the edges file amazon0302.tar.gz file and unzip it so we have just a plain text file. This is the file we'll read to create our initial database.

We'll want to create an index in JanusGraph and let JanusGraph know that this is a Vertex. Open up the Gremlin console and connect to your Compose JanusGraph database using the remote command:

gremlin> :remote connect tinkerpop.server conf/compose.yaml session  

Then, either create or open the amazon graph using the following commands:

// Use these if you don't already have a graph called "amazon"
gremlin> :> def graph =  ConfiguredGraphFactory.create("amazon")  
gremlin> :> graph.tx().commit()  
// If you have already created the above and are reconnecting, use the following instead
gremlin> :> def graph = ConfiguredGraphFactory.open("amazon")  

Now, we can create our index and start adding products as vertices in this graph:

gremlin> :> graph.createIndex('productId', Vertex.class) //(1)  

Let's create a variable to store our final script. This is a local variable to your gremlin console, so we won't send this command to Compose yet:

gremlin> script = ""  

Next, we'll create a local function called getOrCreate that runs for every line of the CSV file and adds a command to the script that either retrieves the product with the given ID or creates a new node with the ID to represent that product.

gremlin> getOrCreate = { id ->  
  script += "g.V().has('productId', " + id + ").tryNext().orElseGet{ g.addV('productId', " + id + ").next() }"
}

Now that we have the script building logic in place, we can open the text file and excute that logic on each line of the file.

new File('amazon0302.txt').eachLine {  

This will start iterating through the file, line-by-line, and a new variable called it will be created and automatically filled with the content from the current line.

We'll want to split that line using the tab character \t since this is a tab-separated list, and run the getOrCreate function on both productIds. These will either retrieve or create a new node, which we'll then use later in the script. We'll also want to ignore comment lines, which start with a # sign. We can do that using the following method:

  if (!it.startsWith("#")) { // ignore comment lines
    (fromVertex, toVertex) = it.split('\t').collect(getOrCreate)
  }

Now that we have the two vertices (also called nodes) in JanusGraph, let's add a line to the script that will create an edge between these two vertices. Since we know this data represents a relationship between two products that were often purchased by the same people, we'll call this relationship usersAlsoPurcahsed:

    script += "g.V(" + fromVertex + ").addEdge('usersAlsoPurchased', g.V(" + toVertex + ")"

Once this line is done running, our script should contain all of the commands needed to create our graph Compose JanusGraph. To run the script on our remote JanusGraph, do the following:

:> @script

Putting those all together in the Gremlin console, we'll get the following:

gremlin> :remote connect tinkerpop.server conf/compose.yaml session  
// Use these if you don't already have a graph called "amazon"
gremlin> :> def graph =  ConfiguredGraphFactory.create("amazon")  
gremlin> :> graph.tx().commit()  
// If you have already created the above and are reconnecting, use the following instead
gremlin> :> def graph = ConfiguredGraphFactory.open("amazon")  
/////
gremlin> :> graph.createIndex('productId', Vertex.class)  
gremlin> :> def g = graph.traversal()  
gremlin> script = ""  
gremlin> getOrCreate = { id ->  
  script += "g.V().has('productId', " + id + ").tryNext().orElseGet{ g.addV('productId', " + id + ").next() }"
gremlin> new File('amazon0302.txt').eachLine { if (!it.startsWith("#")){ (fromVertex, toVertex) = it.split('\t').collect(getOrCreate); script += "g.V(" + fromVertex + ").addEdge('usersAlsoPurchased', g.V(" + toVertex + ")"  
gremlin> :> @script  

We can now traverse our graph finding all of the products that were also purchased by other users:

gremlin> :> g.V(45).out('usersAlsoPurchased')*.getVertexLabel()  
=> Pine Sol Lemon Scented Furniture Polish
=> Brillo Steel Wool Scouring Pad 18 ct
=> Scotch Brite MP-3 Multi-Purpose Sponges, Pack of 3
....

Note: If you're having trouble running these commands, make sure you're running them on the remote server with the :> operator. That's not part of the prompt, but an actual command telling JanusGraph to run our commands on Compose JanusGraph.

You may also get a java heap space error when loading large data sets. You'll need to increase the heap space in your local Java interpreter to accommodate the memory used by the import script. You can also batch up the script execution to allow a lower local memory overhead.

Moving Forward

Now that you know how to use the groovy console with Compose JanusGraph, you can import other data sets and run more complex queries and traversals on them. In the next articles in this series, we'll look at the various graph algorithms that are available to you in JanusGraph, and explore those using real data sets to solve real problems.


Read more articles about Compose databases - use our Curated Collections Guide for articles on each database type. If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com. We're happy to hear from you.

attribution Rémi Walle

John O'Connor
John O'Connor is a code junky, educator, and amateur dad that loves letting the smoke out of gadgets, turning caffeine into code, and writing about it all. Love this article? Head over to John O'Connor’s author page to keep reading.

Conquer the Data Layer

Spend your time developing apps, not managing databases.