DigData on Implementing SQL for MongoDB

This is a guest post from full-stack engineer Alon Horev on how to get more from your MongoDB deployment with SQL-like queries and very cool visualizations via DigData MQL.

Prologue

NoSQL databases started off by solving the hard problems that come with scalability and failover, leaving out features that are hard to scale like joins and transactions.

With new distributed architectures for storing data, data analysis had to become distributed as well, ushering in MapReduce and pipeline-driven languages like PIG for Hadoop or MongoDB’s aggregation framework.

Even though NoSQL databases gained a lot of popularity, many business users are struggling with steep learning curves and lack of analytical capabilities. Some database vendors tackled these issues by introducing SQL-like query languages like Cassandra’s CQL and Couchbase’s N1QL.

Where the products lacked capabilities, the community closed the gap with SQL engines like Hive and Spark for Hadoop/Cassandra.

This post discusses the need and implementation of an SQL engine for MongoDB.

To SQL or not to SQL...

Why SQL is a good match for MongoDB

Developers are usually fond of MongoDB’s query language, it’s easier to code queries with JSON vs. formatting SQL strings.

However, many users running one-off queries struggle with the syntax and complain about missing features like joins and subqueries.

Even an experienced MongoDB user would prefer writing age < 30 or age > 40 over. {$or: [{age: {$lt: 30}}, {age: {$gt: 40}}]}.

Why SQL is not a good match for MongoDB

Verdict

ANSI SQL isn’t a good match for MongoDB but some SQL features could definitely be useful for MongoDB users.

MQL - A MongoDB Query Language For Humans

MQL is a SQL-like query language that gives you the productivity of SQL with the performance of native MongoDB queries. MQL exposes most of MongoDB’s features and adds analytical capabilities like joins and subqueries.

MQL is a proprietary language designed by my company DigData and the MQL server is the heart of our analytics platform.

Here’s an example of an MQL query:

select value * volume as product, symbol  
from db.stocks  
where symbol in ("AAPL", "GOOG") and product > 100;  

Here’s how the query is converted to a MongoDB aggregation framework query:

db.stocks.aggregate([  
  {"$match": {"symbol": {"$in": ["AAPL", "GOOG"]}}},
  {"$project": {"_id": 0, "product": {"$multiply": ["$value", "$volume"]},"symbol": "$symbol"}},
  {"$match": {"product": {"$gt": 100}}}
]) 

The MQL query is almost half the length of the MongoDB query! How’s that for productivity?

How MQL works

The MQL server converts each MQL query to one or more MongoDB queries. Specifically joins and some types of subqueries are broken into several queries because they are not supported in MongoDB.

The following query uses a subquery to return all users whose ages are above the average:

select name, age from test.users where age > (select avg(age) from test.users);  

The MQL server will first invoke a query to calculate the average:

db.users.aggregate([{"$group": {"_id": null, "avg(age)": {"$avg": "$age"}}}])  

After getting the result it would generate this query:

db.users.find({"age": {"$gt": 32.5}}, {"name": 1, "age": 1})  

The MQL server uses several types of MongoDB queries: aggregate, count, distinct and find.

Have a look at a count query:

select count(*) from db.users where age between 18 and 120;  

It’s converted to the following count command which is a lot more efficient than aggregate since it can use indices without iterating over the documents:

db.users.count({"age": {"$gte": 18, "$lte": 120}})  

How does MQL handle JSON?

Since SQL is all about tables and MongoDB is based on free form JSON, MQL lets you dig into complex documents. Lets look at an example blog post collection:

[
  {_id: 1,
   title: "DigData on implementing SQL for MongoDB",
   comments: 
   [
     {"author": "mickey",
      "text": "The first comment"},
     {"author": "ray",
      "text": "Another comment"}
    ]
  },
  {_id: 2,
   title: "Grouping efficiently with MongoDB",
   comments:
   [
     {"author": "mickey",
      "text": "Me again.."}
   ]
  }
]

In order to get the number of comments per author we would have to ‘unwind’ the collection on the comments field and group by the comment.author field. Here’s how you do it with MQL:

select comments.author as author, count(*) from test.posts unwind on comments group by author;  

This query is converted to the following MongoDB query:

db.posts.aggregate([  
  {"$unwind": "$comments"},
  {"$group": {"_id": {"author": "$comments.author"},"count(*)": {"$sum": 1}}},
  {"$project": {"_id": 0,"author": "$_id.author","count(*)": "$count(*)"}}
])

And here are the results:

[
  {
    "author": "ray",
    "count(*)": 1
  },
  {
    "author": "mickey",
    "count(*)": 2
  }
]

Analytics made simple with Compose and DigData

DigData’s analytics platform lets MongoDB users write MQL queries, create beautiful visualizations and share custom reports with anyone.

Compose and DigData are a great match. Databases hosted on Compose can be connected to DigData’s cloud platform in less than a minute. Let’s see how:

Case study: analyzing real estate data

Let’s analyze a real estate dataset for example, this is a sample document of a property sale:

We’ll start by looking at the number of properties per number of bedrooms:

Next thing we’ll do is look at the correlation between the size of the apartment and its price:

A scatter plot easily reveals a trend and makes it easy to spot outliers: oddly cheap or expensive properties.

Finally, we’ll explore these properties on a map. Each property is colored by the number of beds it has. The size (radius) of each marker is determined by its relative price, so expensive properties have larger markers.

Final Words

As the former American Librarian, Chief of Personnel at The New York Public Library, Rutherford D. Rogers once said: “We’re drowning in information and starving for knowledge.” This quote emphasizes the challenging process of turning data into knowledge. By writing queries in a SQL-like syntax that most developers and non-developers are familiar with, it’s easier than ever to work with the data inside your MongoDB deployment. It helps users convert data to knowledge easily.

We believe any user should be able to convert data to knowledge easily.

Do you have MongoDB users in your organisation that struggle with analysis and reporting? Give them the tools and freedom to analyse their data and they’ll be smarter, more productive and happier. Both DigData and Compose are free to try.

For additional features and information visit http://digdata.io.

Alon Horev is a full stack developer and MongoDB expert. He founded DigData to bring analysts closer to their data. You can find him on Twitter as @alonhorev.