Deeper into RethinkDB 2.3 - Fold in Use

Earlier this year, RethinkDB released their Fantasia 2.3 version. In this article, we're going to take a closer look at one of the lesser-known features that came out with that release - the aggregation fold command.

Background

You can find the official fold API reference documentation here. That link will take you to the javascript examples, but you can also toggle to python, ruby, and java examples as well. You'll note below that our example uses ES6 arrows functions to simplify the anonymous function return.

The fold aggregation operates similar to reduce in some ways (processing two inputs at a time from a sequence to produce a third value until all the elements in the sequence are reduced to a single value) and to the transformation command concatMap in some ways (with the ability to concatenate and map to a new output sequence), but it does a bit more than both of them. The fold command processes sequences in order and allows for a base value to be passed in, which neither of the other two commands support. Additionally, it provides an emit function for each item in the sequence being processed.

Below we'll look at a simple, yet useful, example of how you can use the fold aggregation to calculate a year-to-date monthly cumulative revenue.

The data

For our example, we have a table named "revenue_by_month". We've made this simplistic for the sake of this article, but you can apply the same technique we describe below on any values in your tables where you want to calculate a running cumulative total.

The documents in our table have three fields: id, month and revenue. Here's an insert command for one document:

r.table('revenue_by_month').insert(  
  {
    id: 1,
    month: "2015-01-01",
    revenue: 89750
  }
)

Let's have a look at all the documents:

r.table("revenue_by_month").orderBy("month")  

We've got 12 months of revenue data for 2015 for Example Co., as follows:

id  | date       | revenue  
-------------------------
1   | 2015-01-01 | 89750  
2   | 2015-02-01 | 100327  
3   | 2015-03-01 | 96709  
4   | 2015-04-01 | 112835  
5   | 2015-05-01 | 125786  
6   | 2015-06-01 | 124294  
7   | 2015-07-01 | 126734  
8   | 2015-08-01 | 125941  
9   | 2015-09-01 | 146223  
10  | 2015-10-01 | 159125  
11  | 2015-11-01 | 123107  
12  | 2015-12-01 | 138390  

fold

Let's first look at the primary purpose of fold - to process a sequence of items in order, one after another. For this, you can use several of the existing functions... any function that takes two inputs and produces an outcome. For our example, we'll use add. What we want to do is to step through each revenue value to add the value of the current row to the value of the previous outcome. What we'll have at the end of the process is a sum of all the values. Let's have a look at how we'll write the command and then review each of the parts that make it work:

r.table("revenue_by_month").orderBy("month")  
  .fold(0, (prev, cur) => prev.add(cur("revenue")))

In the command above, we're first referencing our table "revenue_by_month" and then selecting the documents from it in order by month - exactly as we showed in the section above to retrieve all the documents in the table. In our case, we could just as easily order by id since our documents are in order that way as well, but the example demonstrates that you could put the documents in any order you choose based on one or more fields. The fold command will process the input according to the order you specify.

Next, we're calling fold and passing in a base value of 0. If our fiscal year for some reason included December 2014 and we knew that value, we could pass that in as the base instead. For our purposes here, we want to just get the cumulative revenue for 2015 so we want to start at 0.

In the next part of the command we're naming two variables, "prev" for the first input to the process and "cur" for the current row's input.

Finally, we're telling fold what we want it to output. In this case, we're telling it to add the "prev" value to the "cur" value using the "revenue" field.

What happens is this: as the first step, our base value of 0 comes in as "prev". That gets added to the January revenue, the "cur" value of 89750. We now have 89750. It becomes the new "prev" and we add to it the February revenue value of 100327. We now have 190007. It becomes the new "prev"... and so on. From this, we'll get the sum of the 12 months of revenue values:

1469221  

So, why not just use the sum aggregation? Yes, of course we could, but if we were doing something other than add where the order of the processing of the values was important to the outcome or where being able to pass in a base value that was not part of the dataset was required, then you can see how fold sets itself apart from reduce and concatMap. For our example, the beauty of what we want to achieve - where those benefits become apparent - actually comes with the emit function of fold. We'll look at that next.

emit

emit outputs an array where each element represents one step in the process. Let's look at that function:

r.table("revenue_by_month").orderBy("month")  
  .fold(0, (prev, cur) => prev.add(cur("revenue")),
    {emit: (prev, cur, ytd) => [ytd]})

Now we've called the emit function and we've specified three variables (it requires three so even if you don't want all three, you still need to specify them). For us, that's the "prev" and "cur" we used for processing the sequence and now we've added in a variable for the outcome of each of those variables being added together called "ytd".

For our emit we're not doing anything much fancy. We're simply outputting the "ytd" array, which will show us the value for each step of the process. In the API reference, you'll see the examples use the branch function which applies an "if... then... else..." logic to the output, but for our example, our outcome does not require any advanced logic to be applied.

Here's what emit returns to us for the "ytd" array:

element | value  
-----------------
1       | 89750  
2       | 190077  
3       | 286786  
4       | 399621  
5       | 525407  
6       | 649701  
7       | 776435  
8       | 902376  
9       | 1048599  
10      | 1207724  
11      | 1330831  
12      | 1469221  

As you can see, we get the year-to-date total for each month. So, at the end of April, the year-to-date revenue was 399621. Ah, yes... that's what fold can do for us in this example!

Next steps

With the "ytd" array, then, as an output of fold, we could choose to perform any other functions on it by applying a then and a do. The array is now open for additional processing. That's outside the scope of this article, however, but we hope we've made the fold command a little less esoteric for you in this article.

If you deploy RethinkDB this month (July 2016), you can get a limited edition t-shirt!

Image by: Michael Gaida