We sat down with Christopher Quinones, Project Lead and Developer on Compose for JanusGraph, to get the scoop on graph databases, the JanusGraph community, and the team's usage of Scylla.
Let’s start out with the basics: What is a graph database? How is it different than other types of databases?
Graph databases are used to store data modeled as nodes and relationships. There are three use cases that I like to think about when comparing a graph database with another database.
Firstly, there's storing a very simple relationship. Say two people meeting at a venue for the first time. In a tabular database, one would need to create two tables to represent our nodes (person and venue) and one additional table to represent the relationship between those nodes (person meeting person at venue). Querying that data for this specific relationship would require one to start joining tables together and that could grow more complex as the data model evolves.
A graph database abstracts all of that from the user and allows one to think about the data more naturally as things and relationships to one another, which is more intuitive for human beings. The traversal languages that power these graph databases also allows the user to represent their query in a manner that’s easy to understand.
Secondly, there are the situations when you need to introduce new properties for a node. In a tabular database, there's quite an effort involved as it requires either adding a column or creating a new table for a particular type.
With a graph database, one can easily introduce a new property into a node without having to introduce them across all nodes of the same type.
Thirdly, there is the process of introducing or completely changing the relationship structure of your data model. With tabular databases, this can involve a lot of heavy lifting, creating new indexes and tables to represent the changes.
Since a relationship in a graph database is nothing more than a pointer between two nodes, this makes restructuring simple. One can change the structure of the database by issuing a few queries to find the nodes for the relationship and introduce or customize the relationships.
What are some of the ideal use cases for graph databases, and why is a graph database better for those solutions?
Most things in this world can be modeled in relationships to one another. Some of the ideal use cases that come to mind are social networks, Internet of Things, recommendation engines, fraud, and master data management. All of the questions that would be asked in these use cases would be in the context of a relationship to another object in the graph such as tell me other people that I might know because I know X or tell me other systems that this person’s information is in and so on.
A graph database makes it easy to model, write, and visualize such queries. The data sets in all of these use cases are evolving and it’s important for the data model to also be flexible enough to evolve in parallel. Take fraud, for example. We could start with a very simple graph to identify fraud risk based on an individual’s relationship to other individuals and later decide to overlay additional information and relationships into that graph so that we can enhance the types of questions being asked.
For users that already have their data in MongoDB or PostgreSQL or Cloudant, etc., how would they use JanusGraph alongside those other databases?
JanusGraph can be used as a primary data store or as a complementary datastore. Which database one picks depends on the use case, requirements for the application, and current landscape.
If one’s data is highly-interconnected and an application will be asking lots of questions based on relationships, a graph database is a good option. You’d be surprised by what you can learn from visualizing a graph data model. If one can foresee the data model evolving (e.g, properties, relationships, new nodes), a graph database will make it easier to adapt the data model. If one’s application can tolerate eventual consistency and queries always start from a single node and traverse outward — aka OLTP — our hosted JanusGraph service is optimized for those use cases and you can enhance your deployment in seconds to support increased workload and data.
If the landscape is such that many systems of record exist, a user can use a graph database as a complementary database by ingesting a subset of their data to ask the relationship-based questions each isolated system can’t answer on their own. Master Data Management is an example of such a landscape.
IBM Compose for JanusGraph uses Scylla, a wide-column database and drop-in replacement for Cassandra. What advantages does Scylla offer over other backend solutions?
JanusGraph continues to evolve as part of the efforts of the open source community. At the time of our implementation, JanusGraph supported five different backends: BerkeleyDB, HBase, BigTable, Cassandra, and In Memory. All of these backends are good, but one needs to understand their application’s requirements before selecting the right tool for the job.
In our case, our goal was to offer a highly-available service that is partition tolerant. Out of these backends, Cassandra fit the bill as it’s oriented towards availability and partition tolerance over consistency. As the name implies, the “In Memory” backend does not persist data outside of memory, which eliminated it as an option. BerkeleyDB is more geared towards data on a single node. HBase and BigTable are more oriented towards consistency and partition tolerance over availability.
We use Scylla as a drop in replacement for Cassandra. It has numerous advantages over Cassandra such as user space networking, avoiding the page cache and thrashing by providing its own row cache, and a share nothing thread model.
Why did IBM get involved with the JanusGraph project?
IBM has a number of products and solutions that rely on graph technology to pull insights from highly interconnected data. Our developers are eager to make contributions to move graph technologies forward. We worked together with companies like Expero, Google, GRAKN.AI, and HortonWorks to start the JanusGraph project, forked from the TitanDB code, so that we can continue to make contributions and back it with the support of a community. It’s very exciting to see the community of JanusGraph developers grow both inside and outside of IBM.
Your team has been actively committing to that project, what are some of the goals you have?
We’ve recently been contributing changes to the Apache Tinkerpop and JanusGraph projects in support of hosting JanusGraph securely on a cloud platform connected to a Scylla backend. This includes performance and graph management enhancements and defects we fixed as part of operating a graph service over the last year. As a future goal, I’d like to see some improvement around ingesting data into the cloud service and being able to visualize the results of a query to aid with developing graph models and queries.
You didn’t start out working on graph databases. What is your background and what did you have to learn to become a proficient user?
I started off my career as a software developer focusing on automating deployments and testing, developing Java-based applications and tooling, and consulting with large customers to automatically deploy new software across their enterprise.
Once you understand that graph databases are about modeling your data as nodes and relationships, and are very flexible, you really start to think about all of the problems you’re presented as "graph problems." Also, there are a lot of documentation and tutorials out there in the Apache Tinkerpop Community, JanusGraph Community, and IBM DeveloperWorks that helped me understand and experiment with graph data models and queries.
There's a range of resources that I recommend for people getting started:
- The JanusGraph on Compose documentation
- The Learning Center for IBM Graph
- The video on Overcoming development challenges with IBM Graph
- And episode 12 of The New Builders: Of Graphs and Gremlins – Graph Database 101
Inspecting sample applications is also a great way to learn more about using graph databases. Here's some that are worth taking a look at:
- Search Slack with IBM Graph
- Have you had "The Talk" with your chatbot about graph data structures?
- Six Degrees of Kevin Bacon
With the project being built around open source projects, the official JanusGraph and Tinkerpop documentation are also good to refer to, especially if you’re looking for more hands-on experience and knowledge with the tools at play.
Thanks Christopher! So, to pull all this together. Graph databases are a great way to persist dynamic data sets. They allow for richer models and queries that treat relationships as first class citizens. The persistence in Compose for JanusGraph is powered by Scylla, a drop-in replacement for Cassandra. And JanusGraph itself is an exciting open source project with a growing developer community.
Interested in learning more? Join us July 25th, 2017 for a webinar led by Keith Lohnes, "Gremlin Traversals for the SQL User."
If you haven't discovered Compose yet, you can sign up for a free 30-day trial below. Already use Compose? You can deploy JanusGraph and we'll add a credit to your account so you can try it out and get the first 5GB of a JanusGraph deployment free for the first month.
Try Compose free for 30 days