RethinkDB Joinery

One of the great things about RethinkDB is that it has join functionality baked in as part of the query engine. This is, compared to MongoDB, where the "lookup" function has been added to the aggregation framework, a much more useful capability which gives a lot more flexibility in designing your data models. That said, there's some pretty important things to bear in mind when you start joining in RethinkDB and top of that list is...

The fastest join is eqJoin

If you are associating documents in one RethinkDB table with documents in another then the most efficient way is for the document on the left hand side of the association to refer, by id, to the document on the right hand side. That's because the id field for any document is indexed by default, so it's faster to look up. Looking up the value also, by definition, means it's equal. That's what eqJoin (or eq_join if you are working in Python) does, an "equals join".

Let's work with a solid example; we're going to be using JavaScript and Node.js 6 for these examples. In the Github repository for this article is a node program called populate.js. It assumes you've creates a database called spystuff and two tables, agents and orgs. When you run it, it'll insert organization records that look like this:

{
    "org": "MI6",
    "alignment": {
        country: "UK",
        "side": "west",
    }
}

into the orgs table, get all the id's of those orgs and then update the agents data which looks like this:

{
    "name": "James Bond",
    "org": "MI6",
    "skill": ["assassination"]
}

to include the appropriate organization id numbers (as "org_id"), remove the "org" field and insert the result into the agents table.

Now, we're set up with a problem joining the tables is made for. We want to get all the agents' data with their organization data in the same document. This is where we'll use the eqJoin function. The orgs table is already indexed by id and if we look the command at the core of eqjoin.js we find this:

r.table("agents").eqJoin("org_id", r.table("orgs"))  

This starts with the agents table and applies an eqJoin to it, telling it to use the agents' org_id field and look it up in the orgs table. It'll default to using the tables primary key.
That command gives records back like this:

  {
    "left": {
      "id": "58662b62-6a09-422b-88ad-c4acaabaa29b",
      "name": "John Drake",
      "org_id": "192711a9-f73b-40f4-886e-07b9a188c47e",
      "skill": [
        "investigation"
      ]
    },
    "right": {
      "alignment": {
        "country": "UK",
        "side": "west"
      },
      "id": "192711a9-f73b-40f4-886e-07b9a188c47e",
      "org": "M9"
    }
  }

Yes, as it comes out of the join commands, the left and the right side are still kept separate. To take care of this, RethinkDB recommends the zip function. If we add a .zip() to our query, like so:

r.table("agents").eqJoin("org_id", r.table("orgs")).zip()  

We get this:

   {
    "alignment": {
      "country": "UK",
      "side": "west"
    },
    "id": "192711a9-f73b-40f4-886e-07b9a188c47e",
    "name": "John Drake",
    "org": "M9",
    "org_id": "192711a9-f73b-40f4-886e-07b9a188c47e",
    "skill": [
      "investigation"
    ]
  }

Which looks great until you look a little closer and notice that the id from the organization document has wiped out the id belonging to the agent document. Not a problem, as there's the without function too and that can get rid of that right hand side id field with .without({"right": {"id":true}}) like so:

r.table("agents").eqJoin("org_id", r.table("orgs")).without({"right": {"id":true}}).zip()  

and now we get:

   {
    "alignment": {
      "country": "UK",
      "side": "west"
    },
    "id": "58662b62-6a09-422b-88ad-c4acaabaa29b",
    "name": "John Drake",
    "org": "M9",
    "org_id": "192711a9-f73b-40f4-886e-07b9a188c47e",
    "skill": [
      "investigation"
    ]
  }

Even though we were deleting the id field on the right hand side, we've retained the organization id field which we were joining on. The RethinkDB documentation on joins shows a couple of other ways you could mitigate this overwriting, but this is the simplest way for simple eqJoins.

Indexes and eqJoin

The eqJoin function is actually very simple at it's core. It moves through the left hand table using the specified field and simply looks up the value in the index it's been given on the right hand side. By default, that's the id field as that is the primary key and index on the right hand side. But it doesn't have to be that index. You can point at any index that exists for the right hand side and as long as there are values there that match up with the left hand side values, you'll get results for that query.

Let's add a table of assets to our spystuff database - you'll find the code for this in populate2.js in the repository. It adds records like this:

  {
    "type": "Black Helicopter",
    "use": [ "stealth", "investigation" ],
    "designer": "UN"
  }

First, let's unite these assets with their organizations. Let's assume that if a country designed an asset, then organizations in that country can use that asset. We'll need to create an index on the designer field of the assets which we can use with eqjoin. We'll do that in the populate2.js file with:

r.table("assets").indexCreate("designer")  

Now we can do our join - you'll find it in eqjoinindex.js.

r.table("orgs").eqJoin(  
             r.row("alignment")("country"), 
             r.table("assets"),{ index:"designer"}
         ).without({"right": { "id": true }})
         .zip() 

So, taking that step by step, we start with the orgs table and we apply an eqJoin to it. The field we want to join on isn't a top level field so we pass r.row("alignment")("country") so we can access it. We then tell eqJoin we want to join with the assets table. Here's the new bit; in the last options parameter, we pass a { index:"designer" } to tell eqJoin to use that index to lookup on, so we're now joining the alignment.country of organizations with the designer of assets which gets us records like this:

 {
    "alignment": {
      "country": "USA",
      "side": "west"
    },
    "designer": "USA",
    "id": "938201f0-6a24-4f0a-91ee-cc1751df23a4",
    "org": "CIA",
    "type": "Laser Pen",
    "use": [
      "management",
      "combat"
    ]
  }

Now we can see the CIA has access to laser pens. It's also quite a good example of why you may not want to zip records at all. Let's show another aspect of this secondary index joining; multi-indexes. Those are indexes where the field being indexed is an array of values; when the indexer is told to index, it indexes the record for each one of these values. So how can we use that?

Say we want to match our agent's primary skills with the assets they can use. We'll want to index that "use" field first. The example code does just that with r.table("assets").indexCreate("use",{ multi: true } ). The multi:true part lets the index work with the array as discrete values.

With that index in place, let's make a join query:

r.table('agents').eqJoin(r.row("skill")(0),r.table("assets"),{ index:"use" }).zip()  

There's a whole lot of things happening here. The r.row("skill")(0) is referring to the first value in the array of values in the "skill" field. This is closer to being a function than a reference, and it is worth noting that eqJoin can take a function to create the value to match with. We point at the "assets" table as the right hand side and we telling it to use the index we created with { index:"use" }. There is another other option by the way; "ordered" which when set to true will sort according to the left hand side's input - we're just not using it here. Anyway now the effect of this is to make it seem that when the first skill of an agent is present in the array of "use" in the asset document, the two documents will be joined and we've added a zip to merge the fields to get something like:

  {
    "designer": "Global",
    "id": "cd8aefc6-5442-499d-84bc-9fb85172b6f8",
    "name": "Chuck Bartowski",
    "org_id": "11a662d3-0477-4c96-a0d4-3ceebc0c29a4",
    "skill": [
      "investigation",
      "stealth"
    ],
    "type": "Microdrone",
    "use": [
      "investigation",
      "stealth",
      "assassination"
    ]
  }

We could do another eqJoin against the orgs table - three way joins are easy enough - but that's demonstrated the flexibility of the eqJoin function.

What of inner_join and outer_join?

There are other join functions - innerJoin and outerJoin – but they are slower and less efficient than eqJoin. Both use a function which evaluates true or false. That means though that there's no scanning of the left hand side and index lookups for the right hand side - it's all scanning and evaluating the function for the right hand side. So it's slower. On the up side, if it's a join you want to do that isn't based on a simple equality of fields, these are the functions you are looking for. We could do something similar to the previous eqJoin command, without the index like so:

r.table('agents').innerJoin(r.table('assets'),(agrow,asrow) => { return agrow("skill").setIntersection(asrow("use")).count().gt(1); } )  

What we are doing here is a set intersection between the agents skill and the assets use arrays and returning true if two or more items are in the intersection which turns out to be one agent with two assets. Powerful, but you'll take a hit in terms of performance. Remember these tiny tables we're using are living in the cache, probably the processor cache even - when scaled up, you could really pay the price in performance.

Join power

So we've looked at RethinkDB's join functions and as you can see they deliver what we typically need from a join function; a simple binding between records based on the equality of fields. It's simple, quick and clear.