MongoDB and RethinkDB: Document-Based Rivalry

2009 was quite a year for web development. MongoDB, the modern, document-based, non-relational database management system was released into the wild and so, too, was NodeJS. Suddenly, those looking to dive into web development or re-factor their existing projects had a pair of revelations. JavaScript for the front-end and the back-end? A JSON-based document database with dynamic schemas?

Great excitement! It seemed as though the promises made by these new technologies had finally caught up with the imagination and "real-time" performance expectations of the modern day developer. And, best yet, the barrier for entry into the software world had - seemingly - been lowered.

Many developers will admit that MongoDB fundamentally changed the way database management systems are designed and approached. However, years in the spotlight have worn off some of the initial lustre. As MongoDB was used within intensive applications at scale, it became clear that there were significant caveats to go along with the benefits of using this new system over its relational, SQL brethren.

Being the first one with new ideas is never easy. Now, 6 years later in 2015, we are still seeing patches (like MongoDB 3.2 and the promise of $lookup to achieve join-like capabilities) released to pay off some of the technical debt created by early promises.

Following MongoDB's pioneering innovation, we've seen some impressive rivals enter the space. In this article, we will be looking at some key features (technical or otherwise) of two fantastic document-based databases: MongoDB and RethinkDB.

RethinkDB

RethinkDB is considered a strong choice for those building applications that need to live and breathe in real-time. When a user makes a request, the application needs to respond at near-instantaneous speeds. Like MongoDB, RethinkDB is a fast and flexible JSON-based database management system. It is not the best choice if you need stark ACID compliance or a rigid schema to adhere to.

A relative new-comer, RethinkDB was released to the public in 2011. It boasts the "best of both worlds", with a modern, functional, dynamic and flexible approach to data storage and access intertwined with features users have become accustomed to using in relational-type data stores.

We'll take a look at a few unique features that make RethinkDB shine. First, a tremendous asset of RethinkDB: its own querying language, ReQL.

ReQL

ReQL is modelled after LISP and Haskell. It's a language built upon functional programming principles. Uniquely, ReQL can 'chain' querying commands together to make light-weight server-side queries, embedded directly into the language (Python, JavaScript, and Ruby, at present) you've used to build your application. The queries are efficiently lazy, capable of doing math, and atomically split themselves across members of your cluster. Spinning up a Compose RethinkDB instance, for example, will give you three members in your cluster that these queries would be spread over.

Keeping up with the demands of a database designed to be dynamic, RethinkDB has powerful abilities baked right into their custom data browser, controlled through ReQL.

HTTP Querying

Let's put our imaginations to work... You play a card game that was invented in 1993. The game involves various creatures, magical spells, enchantments, land-types - all sorts of fantastically drawn and detailed pieces of cardboard wizardry.

You hope to one day create an application as a homage to this game in which you've invested your time and money. You notice that a small website has detailed card information of each set that's been released since the games inception - the data they've provided is in JSON format.

As a developer, this is music to your ears - JSON is the foundation upon which document-based database management systems have been built.

Image of search

...

        {
            "artist": "Christopher Rush",
            "cmc": 0,
            "id": "c944c7dc960c4832604973844edee2a1fdc82d98",
            "imageName": "black lotus",
            "layout": "normal",
            "manaCost": "{0}",
            "multiverseid": 3,
            "name": "Black Lotus",
            "rarity": "Rare",
            "reserved": true,
            "text": "{T}, Sacrifice Black Lotus: Add three mana of any one color to your mana pool.",
            "type": "Artifact",
            "types": [
                "Artifact"
            ]
        },

From the RethinkDB administrative console, you can run a single HTTP GET request to a JSON file stored within MTGJSON.com. You can then explore the data, which is laid out cleanly in well-formatted JSON. As hinted at in the introduction to ReQL, chaining is where the power of ReQL becomes apparent.

r.table('cards').insert(r.http('http://mtgjson.com/json/LEA.json')).run(conn, callback)  

In one command we've created a table called cards, queried the primary JSON dataset, then inserted the data into our new table. On top of this, we can then factor in Changefeeds.

Changefeeds

To get users engaging with an application, real-time responses in the form of notifications often appear. Use-cases for active listening are powerful and frequent. Did a user complete an activity that awarded an increased point total? Did the pair of socks they've been eagerly awaiting finally come on sale? Or, in the case of our previous nerdy card game example, is that fancy new card available?

cards = r.table('cards').changes().run(conn)  
for new_card in cards:  
    print new_card

Consider what we've just demonstrated, where a server-side query listens for .changes() to the cards table, then prints out a response when a new card is received. If we combine a Changefeed with an HTTP request to a JSON document...

r.table('cards').changes().filter(  
   r.row('new_card')('cards').gt(r.row('old_card')('cards'))
)('new_card').run(conn, callback)

... we can establish a clean mechanism for storing, interpreting, then displaying data using primary data as our source. Passive. Powerful. Always up-to-date.

Distributed Table Joins

A pain point in the world of document-based database management systems has been the handling of JOIN commands. A lucid way to conceptualize a document database is thinking, as the name implies, of a binder full of documents. Each binder (or, "collection") is a storage medium full of documents.

{
 "id": "87db07b539528dd70dfb191cb357f7f4",
 "title": "Adventure Time",
 "seasons": [
  {"season_number": 4,
   "episodes": [
    {"episode_within_season": 13,
     "title": "Princess Cookie",
     "episode_id": "8d966d1b4934b0330fd63ac82f575a51",
     "actor_id": "294bab9d9b224ddcce34d24518db47ff",
     "reviews": [{...}]
}

In an effort to categorize all the episodes of my favourite TV Show, I've made a collection (tv_show) and populated a document with key information: name of the episode, title, reviews, the season, and so on. In fractal-like dissonance, you can see how we fill an array (seasons) with another array (episodes), into yet another array (reviews).

After accumulating a set of data, the question to answer is: how will we present this data to our users? If you consider our above example, we may want to explore episode_within_season by season_number. So, a logical path through our data would be:

Title > Season_Number > Episode_within_Season

This use-case is cut and dried. It is an example of the non-relational model of data storage. It's quick and linear. However, if you were to add a new perspective to your site, the inability to join collections of data has you stuck. For example, you now want to include popular voice actors and expand your library of TV shows.

In a typical relational model - requiring more structure and fore-sight - you would have a table for television shows, a table for voice actors, and in the middle of them a JOIN table allowing you to associate any number of voice actors with any number of television shows. In MongoDB (more pure-document storage), you'd require a significant amount of hacking to establish these relationships.

Now, requisite pre-amble aside, we can gracefully demonstrate RethinkDB's ability to utilize JOIN commands to combine tables and automatically distribute the end result. The command will be sent to the appropriate nodes, combine the data, then present you with your linked tables.

To demonstrate joins in an example, we'll simplify our above document to remove the arrays:

{
        "id": "87db07b539528dd70dfb191cb357f7f4",
        "title": "Adventure Time",
        "type": "Comedy",
        "actor_id": "294bab9d9b224ddcce34d24518db47ff"
}

... Now, the tv_show documents simply contain a title, a type, and an actor_id.

{
    "id": "294bab9d9b224ddcce34d24518db47ff",
    "voice_actor": "John DiMaggio",
    "style": "Hilarious"
}

We want to combine this new voice_actor table with our tv_show table, listing the actor and his style with the name of the TV show.

r.table("tv_show").eqJoin("actor_id", r.table("voice_actor"))  

We've now commanded ReQL to join the tv_show and voice_actor tables by taking the actor_id from tv_show and associating it with voice_actor.

{
    "left": {
    "actor_id":  "294bab9d9b224ddcce34d24518db47ff" ,
    "id":  "87db07b539528dd70dfb191cb357f7f4" ,
    "title":  "Adventure Time" ,
    "type":  "Comedy"
    } ,
    "right": {
    "id":  "294bab9d9b224ddcce34d24518db47ff" ,
    "style":  "Hilarious" ,
    "voice_actor":  "John DiMaggio"
    }
}

On the left, we have our TV Show. On the right, our voice actor.

r.table("tv_show").eqJoin("actor_id", r.table("voice_actor")).zip()  

By appending a command to our chain, zip, we'll combine these two fields into a result set.

{
    "actor_id":  "294bab9d9b224ddcce34d24518db47ff" ,
    "id":  "294bab9d9b224ddcce34d24518db47ff" ,
    "style":  "Hilarious" ,
    "title":  "Adventure Time" ,
    "type":  "Comedy" ,
    "voice_actor":  "John DiMaggio"
}

Our end result will combine the hilarious John DiMaggio with Adventure Time.

Cherry On Top

Taking a step back from the technical, RethinkDB has a killer feature that you won't see advertised anywhere but, yet, hides in plain sight. Rethink boasts incredible documentation. Along with great words, in understated and eloquent fashion, you will see the team working transparently. They are accessible and willing to chat within #RethinkDB on Freenode. They participate actively in user groups, respond to blog comments and GitHub questions, comments, and performance reports.

They have taken the time, effort, and care to plant the seeds for what is becoming a wonderful community to be part of. Not impressed with the performance of Rethink? Have a strange edge-case? Write it up, present it, and watch as your issues are addressed by a member of their team.

As someone taking a gamble on a newer technology, this sort of "doors open" support methodology helps alleviate worry and makes you feel like you could contribute to making their product even better.

Rethink_Couch_Chillin

MongoDB

Compose started out as MongoHQ. We've seen, and continue to see, beautiful things created with MongoDB as the rock-like foundation. Just like the applications growing on top of it, MongoDB itself continues to evolve and improve.

As the proclaimed platform "For GIANT Ideas", MongoDB is the non-relational fabric to an impressive tapestry full of world-class customers. Here are some unique reasons why MongoDB continues to enjoy growth and success.

Meteoric Up-Starts

Iterate, adapt, pivot! That's the sort of mantra you're likely to hear in different start-ups around the world. Non-relational databases like MongoDB give you the benefit of speed and flexibility. Without the need to specify a set-in-stone schema before hand (one of our Write Stuff articles ponders the should-I-or-shouldn't-I dialogue in schema creation and we, ourselves, refute the concept of a "schema-less" database), you are free to enjoy the wild and free flexibility of an adaptive data structure.

It's simple to get an idea up and running. It gets even more simple when you look at how MongoDB integrates into some of the most powerful (I'd even go as far as magical, in the case that follows) frameworks in development.

{
   "galaxy.meteor.com": {
        "env": {
                "MONGO_URL": "[URL HERE]"",
                "MONGO_OPLOG_URL": "[URL HERE]"
         }
    }
}

The above is how you would configure an automatically scaling, production grade database within the package.json file from the Meteor JavaScript application platform. Pair this with Meteor Galaxy, or any other Platform-as-a-Service provider and one meteor deploy later you've got a cloud hosted, automatically scaling application. Intrigued? We've published an article about getting this set-up with our deployments.

One of the most impressive aspects of Meteor you may enjoy reading about is the minimongo API re-implementation built into Meteor. Not only is Meteor/MongoDB convenient and simple, the promise of a client-side MongoDB API capable of providing illusory performance improvements makes it a compelling choice for the modern hacker.

GridFS

A common occurrence in the modern web application is the ability to allow a user to upload their own content. In the more traditional, relational database methodology, you would likely have had a second server for storing these files. The files would be stored, read, and wrote "whole". GridFS provides modern answers to the caveats you would have run into - like issues with back-ups, replication, and synchronization. If you're comfortable with the absence of versioning and are willing to trade atomicity to realize these benefits, then this might be a great fit for you.

GridFS takes files greater than 16MB and chops them into 255KB "chunks", and stores each chunk into its own document. Files are then stored in two collections: files (for metadata) and chunks (for the chunks of the file). A default "bucket" name is then appended to the front of each collection, labelled fs. You then wind up with fs.files and fs.chunks.

So, a 25MB file you've uploaded will be broken down into 98 chunks sized 255KB (25MB * 1000KB / 255KB), spread over two collections containing a bucket label.

}
   "files_id": ObjectId("987w23d49q93iuye0j4mn77d"),
   "n": NumberInt(97),
   "data": "Audio Data"
}

Each chunk contains a fraction of the data.

Why eviscerate your files this way? Because accessing sections of files becomes a simple and fast operation. Consider a case where you "scroll" through an audio file and require your player to skip to a specific portion of a track. Instead of needing to interpret the entire file, the fs.files metadata points you to the appropriate chunk-document where your data is stored. The individualized id of each chunk means that a small piece, in a linear numeric sequence, is queried to deliver your content.

If you're confident you will never store data greater than 16MB, this might not be useful to you. However, the potential is immense.

Powerful Friends

MongoDB is the veteran. As mentioned in the introduction, the 2009 release attracted a significant amount of attention. That attention snowballed the Mongo eco-system into an expanse of libraries, third-party support, and drivers that we see today. MongoDB is to the modern MEAN (MongoDB, Express, Angular Node) stack what MySQL is to the LAMP (Linux, Apache, MySQL, PHP) stack.

MongoDB is well supported, with first-party driver support available for the following languages: C, C#, C++, Java, JavaScript (Node.js), Perl, PHP, Python, Ruby, and Scala. Community drivers include Go and Erlang.

Our open source application Transporter can help you easily port MongoDB data into Elasticsearch. We've written out instructions on how you can apply this to your deployments. DigData or Cloud9 provide visual aid in analyzing your data. If you have familiarity with socket.io and Mongoose, you can graph your data. For those with a SQL background, you can use SlamData to help you migrate from SQL to MongoDB, or gain insights into your MongoDB data using familiar SQL patterns.

When you're on to something special, your community will help fill in gaps as you grow. So far, that's been the story with MongoDB. If you're looking to build something, odds are there is a neat framework, slick application, or vital piece of documentation that will help you.

Sprouting

The hard-working folks at MongoDB are committed to growing their product. Along with substantial patches implementing major features (like the aforementioned $lookup coming in 3.2), they're hard at work creating software to help their users get the most out of their offerings.

They have created MongoDB Cloud Manager, MongoDB Ops Manager, and offer Enterprise Grade Support. The management software provides elegant, beautiful, and simple GUIs to help analyze and optimize your MongoDB deployments. Support for Enterprise helps your developers squeeze the best performance out of their databases.

With tools and expertise available to those willing to shell out extra dough, MongoDB is looking to validate itself as a powerful choice at larger scale.

Leaf

In The End...

When comparing two powerful, growing, and well-loved pieces of software, it's difficult to declare one as being more viable than the other. They are both fast and flexible JSON document stores and are uniquely viable in some use-cases.

The most important part in choosing the right one for your needs, is understanding what your needs are and what you're getting into before you hit production.

Whether you're using MongoDB or are considering jumping into RethinkDB, we think you're going to love what they can offer you.