Deeper into 3.2 – Lookup

One of the most anticipated features of MongoDB 3.2 is lookup. With the arrival of MongoDB 3.2 on the Compose platform, it's a good time to look at lookup and other features. Lookup brings the ability while processing one collection in the aggregation pipeline to "lookup" other documents in another collection and it has the potential to change how people organize their analytics on MongoDB.

Over in the world of relational databases normalization, the splitting up of the data into related tables, has been the staple of creating well-formed schema. Relational databases could use a JOIN operator to get the database to map related tables into one or more rows at query time. There's a range of JOIN types which can be achieved and it's one of the real strengths of a relational database.

The first big wave of NoSQL databases eschewed the relational model though. The document model was built around keeping all the relevant data in a document and denormalization was the watchword. As time went on, the ability to have references to other documents in other collections was offered. But acting on those references was something for the application developer to do, not the database. With this also came complications and many developers opted for using business logic to locate related records. For most application developers and applications, this was a reasonable compromise.

Then analytics came to the fore for NoSQL databases. Analytics technologies leveraged the power of SQL and JOIN to create rich datasets within the database to crunch down for insights. Now, lacking this option, there were a number of alternatives. You could denormalize your data more, by copying related documents into the collections you were planning to analyze or you could turn to vendors who'd created SQL front ends to MongoDB so that analytics tools could work. Both had advantages and disadvantages.

Within MongoDB there's the aggregation framework, the database's own way of bulk processing data from a collection, and it's there that MongoDB Inc looked to solve this problem. If you use MongoDB and still haven't looked at the Aggregation framework, run, don't walk, to read about it. It lets you filter, project, group and sort your data. Let's give an example. Say we have a game and users can purchase virtual items:

mongos> db.purchases.find().pretty()  
{
    "_id" : ObjectId("568cfecf290d89f9d29fd95e"),
    "buyer" : "bill",
    "sku" : "abc5"
}
{
    "_id" : ObjectId("568cfed6290d89f9d29fd95f"),
    "buyer" : "fred",
    "sku" : "abc5"
}
{
    "_id" : ObjectId("568cfedd290d89f9d29fd960"),
    "buyer" : "jonah",
    "sku" : "abc5"
}
{
    "_id" : ObjectId("568cfee5290d89f9d29fd961"),
    "buyer" : "jonah",
    "sku" : "xyz10"
}

Now say we want a count per sku of purchasers. Let's use the aggregation framework:

db.purchases.aggregate([{  
  $group: {
    _id: "$sku",
    total: {
      $sum: 1
    }
  }
}]

We tell the aggregation pipeline to group the purchases by our stock control code (sku). For each record with a particular stock code, we update an accumulator by adding one to it. When we run that:

mongos> db.purchases.aggregate( [  { $group: { _id: "$sku", total: { $sum: 1 } } }] )  
{ "_id" : "xyz10", "total" : 1 }
{ "_id" : "abc5", "total" : 3 }

We now have the number of purchases per stock code. There's more to products than just a stock code though and we have another collection called products for that:

mongos> db.products.find().pretty()  
{
    "_id" : ObjectId("568d014e290d89f9d29fd962"),
    "name" : "Laserblaster",
    "price" : 10,
    "stockcode" : "xyz10"
}
{
    "_id" : ObjectId("568d0170290d89f9d29fd963"),
    "name" : "Firestorm",
    "price" : 5.75,
    "stockcode" : "abc5"
}

This collection holds the prices and names. If we want to add that data to our aggregation, here's where we can make use of $lookup as another stage:

db.purchases.aggregate([{  
  $group: {
    _id: "$sku",
    total: {
      $sum: 1
    }
  }
}, {
  $lookup: {
    from: "products",
    localField: "_id",
    foreignField: "stockcode",
    as: "sku_details"
  }
}])

We've added a $lookup stage to our pipeline. We want it to lookup items in the products collection so we set the from value to "products". Recall that the $group stage gave us an _id that was the SKU number; that will be the localField value, the value we want to lookup in the "from" collection. We then need to say where we want this to be looked up. In products there's a stockcode field which we want to match with; this is the value we set foreignField to.

Finally, we need to say where to put any and all documents that match between the _id and stockcode; $lookup creates an array of those documents and inserts that into each document we are processing. All we need to say is the name of the field they are inserted as; we set as to "sku_details". Running this gives us...

mongos> db.purchases.aggregate( [ { $group: { _id: "$sku", total: { $sum: 1 } } }, { $lookup: { from: "products", localField: "_id", foreignField: "stockcode", as: "sku_details" } } ] ).pretty()  
{
    "_id" : "xyz10",
    "total" : 1,
    "sku_details" : [
        {
            "_id" : ObjectId("568d014e290d89f9d29fd962"),
            "name" : "Laserblaster",
            "price" : 10,
            "stockcode" : "xyz10"
        }
    ]
}
{
    "_id" : "abc5",
    "total" : 3,
    "sku_details" : [
        {
            "_id" : ObjectId("568d0170290d89f9d29fd963"),
            "name" : "Firestorm",
            "price" : 5.75,
            "stockcode" : "abc5"
        }
    ]
}

Giving us an array with the product details. Now, we could use the $unwind aggregation operator to flatten the array out, but let's do that exercise with a different aggregation. Let's flip the lookup around so we can get a list of buyers for each sku. So we're going to aggregate on products and $lookup from purchases:

mongos> db.products.aggregate( [ { $lookup: { from:"purchases",localField:"stockcode",foreignField:"sku",as:"buyers" } } ] ).pretty()  
{
    "_id" : ObjectId("568d014e290d89f9d29fd962"),
    "name" : "Laserblaster",
    "price" : 10,
    "stockcode" : "xyz10",
    "buyers" : [
        {
            "_id" : ObjectId("568cfee5290d89f9d29fd961"),
            "buyer" : "jonah",
            "sku" : "xyz10"
        }
    ]
}
{
    "_id" : ObjectId("568d0170290d89f9d29fd963"),
    "name" : "Firestorm",
    "price" : 5.75,
    "stockcode" : "abc5",
    "buyers" : [
        {
            "_id" : ObjectId("568cfecf290d89f9d29fd95e"),
            "buyer" : "bill",
            "sku" : "abc5"
        },
        {
            "_id" : ObjectId("568cfed6290d89f9d29fd95f"),
            "buyer" : "fred",
            "sku" : "abc5"
        },
        {
            "_id" : ObjectId("568cfedd290d89f9d29fd960"),
            "buyer" : "jonah",
            "sku" : "abc5"
        }
    ]
}

Now we have each product with an array of buyers attached to them. Now, the $unwind operator can be pointed at that array and for each document in the array, create a new document which has one object that contains that array document. And all that takes is for us to add a stage { $unwind:"$buyers" }

mongos> db.products.aggregate( [ { $lookup: { from:"purchases",localField:"stockcode",foreignField:"sku",as:"buyers" } }, { $unwind:"$buyers" } ] )  
{ "_id" : ObjectId("568d014e290d89f9d29fd962"), "name" : "Laserblaster", "price" : 10, "stockcode" : "xyz10", "buyers" : { "_id" : ObjectId("568cfee5290d89f9d29fd961"), "buyer" : "jonah", "sku" : "xyz10" } }
{ "_id" : ObjectId("568d0170290d89f9d29fd963"), "name" : "Firestorm", "price" : 5.75, "stockcode" : "abc5", "buyers" : { "_id" : ObjectId("568cfecf290d89f9d29fd95e"), "buyer" : "bill", "sku" : "abc5" } }
{ "_id" : ObjectId("568d0170290d89f9d29fd963"), "name" : "Firestorm", "price" : 5.75, "stockcode" : "abc5", "buyers" : { "_id" : ObjectId("568cfed6290d89f9d29fd95f"), "buyer" : "fred", "sku" : "abc5" } }
{ "_id" : ObjectId("568d0170290d89f9d29fd963"), "name" : "Firestorm", "price" : 5.75, "stockcode" : "abc5", "buyers" : { "_id" : ObjectId("568cfedd290d89f9d29fd960"), "buyer" : "jonah", "sku" : "abc5" } }

There's still a lot of noise in that data, what with two _id fields and the sku repeated in the buyers object. Not a problem though. All we need to do is add a $project stage to project the items of data we do want into new clean documents. Here's the one we'll add...

{
  $project: {
    "_id": 0,
    "name": 1,
    "stockcode": 1,
    "price": 1,
    "buyer": "$buyers.buyer"
}

We specifically drop the leading _id field, by giving it a 0/false value, then explicitly include name, stockcode and price. The only thing, in this example, we want from the buyers object is the buyer field so we'll use $project's ability to create new fields using data extracted from other fields and set buyer to $buyers.buyer. We're ready to run now:

mongos> db.products.aggregate( [ { $lookup: { from:"purchases",localField:"stockcode",foreignField:"sku",as:"buyers" } }, { $unwind:"$buyers" }, { $project: { "_id": 0, "name": 1, "stockcode":1, "price":1, "buyer": "$buyers.buyer" } }  ] )  
{ "name" : "Laserblaster", "price" : 10, "stockcode" : "xyz10", "buyer" : "jonah" }
{ "name" : "Firestorm", "price" : 5.75, "stockcode" : "abc5", "buyer" : "bill" }
{ "name" : "Firestorm", "price" : 5.75, "stockcode" : "abc5", "buyer" : "fred" }
{ "name" : "Firestorm", "price" : 5.75, "stockcode" : "abc5", "buyer" : "jonah" }

So now we've merged data using lookup and have got to a point where we have clean, flat documents ready for further processing. We'll leave you to discover how you can use $lookup for your aggregations.

With the addition of $lookup to aggregation, this feature of MongoDB is more useful than ever before. But it's not the only enhancement to the aggregation framework. We'll have a look at that in the next part of this series.