Why (and how to) Redis with your MongoDB

Are you loading your database up with work it doesn't really need to be doing? Are you counting things and storing them in their own collections? Are you spending time working out if something was unique just to count the number of unique things you've seen in your data?

If this rings true to you, maybe it's time you met Redis. Redis isn't a replacement for your general purpose database, be it MongoDB or PostgreSQL. It's better to think of it as a memory turbo-boost for your stack. It's a remarkably flexible tool, but more importantly, as it runs all in memory (with persistence taking place in the background as a secondary function) it's fast. It isn't complex either. Redis presents itself as a key/value store with the ability to handle binary keys and a range of data structures as values.

You can use it as a straight cache, but the real power comes when you use Redis's ability to manipulate the data it holds in memory. The best way to see that is to see some examples.

Count on Redis

Let's talk about some counting. Imagine you have a sales system where you want to have statistics about how many of the various groups of stock items have been sold. And let's say we've got a two part SKU number "1234-56789" where the first part of the number is the group id, a buyer name and a quantity like this:

> db.mystock.findOne()
{
    "_id" : ObjectId("55c3471b14808a6371e1a358"),
    "sku" : "1234-56789",
    "buyer" : "fred",
    "qty" : 10
}

We can break out MongoDB aggregation to work out the sum of the quantities per SKU group. It'd look something like this:

db.mystock.aggregate([{  
  $project: {
    sku: 1,
    skugroup: {
      $substr: ["$sku", 0, 4]
    },
    qty: 1
  }
}, {
  $group: {
    _id: "$skugroup",
    count: {
      $sum: "$qty"
    }
  }
}])

And that returns us something like this:

{ "_id" : "4321", "count" : 45 }
{ "_id" : "1234", "count" : 159 }

Now, your reaction if you haven't seen MongoDB aggregations before is likely to be "What?", but to dodge going into a deep explanation of aggregations (which are cool), just think of it as a super-query.
And as a super-query, there will be a cost in processing it on the database. In our imaginary scenario, this is fine when we only need updated numbers say once a day, or every few hours. And as long as that list of transactions doesn't get too large. Then, one day, someone said they needed real-time updates of the stock groups. The aggregations were run more often and the already loaded database got more loaded.

"Let's stop aggregating" was the next suggestion. Instead another collection, indexed by SKU group, would be updated with a $inc operator. After an insert of a new purchase, we would extract the SKU group and do an upserted updated to increment our SKU group collection of counts. In older versions of MongoDB this would have seen the entire database being locked, now, only the collection would be locked but that could still be a bottle neck.

Lets do this with Redis instead. We can have integer values as one of the most basic data value types, but you'll note there's just one big keyspace. That's ok because we can impose our own namespace. One Redis convention is using a colon to divide up namespace, so let's have keys prefixed with "skugroup:" and followed by the SKU group and giving us keys like "skugroup:1234". If the quantity was, say, 8, then the entire Redis command we'd send would be

INCRBY "skugroup:1234" 8  

And that would return to us the updated value. And if the key didn't exist, it would be created, the value set to 0 and then incremented. There'll only be one problem which is that if there's a huge amount of keys in the Redis server, then the command to work through them, SCAN could take longer than you like. What can you do? Well, another Redis data type is the hash. The hash can hold its own keys and values away from the global Redis key space. This is easier to work with and avoids cluttering things, keeping performance very high. So let's make our key "skugroup" and increment a key/value pair within there...

HINCRBY skugroup 1234 8  

Yes, the command is almost identical and the effects are the same; if skugroup doesn't exist, it will be initialized, if the key 1234 doesn't exist it'll be created and set to 0, and then incremented by 8, with the updated value returned. The HSCAN command can iterate through our hash for us so we can retrieve our values.

With Redis handling the counts, there's no impact on MongoDB performance, count updating is super-fast and the results are right-up-to-date.

Queue Redis

With a modern application architecture, you may have a lot of servers all working on different parts of your problem. Distributing the work and keeping things coordinated can be a tricky to architect. Got thousands or millions of messages passing around the system? The solution there is obviously to build your applications against a enterprise-grade messaging platform.

But if its not that complex, say you want to queue jobs up to be run by different applications and allow those applications to pick up those jobs from a server, then a full-on messaging platform may not be something you want to spend your time engineering into your system. Someone might suggest using a database to store a table or collection of jobs to be done and get applications to insert their requests and poll for things to do with a regular query. Which sounds fine on paper and as long as you take care of the atomicity of adding and removing items and the queue ordering and...

Which is where Redis comes in. In Redis, all operations are atomic, so there's no danger of two systems taking the same item from the work queue. A queue can easily be represented as a Redis list, and that comes with a whole range of operations to work with it. You can RPUSH or LPUSH values onto a list and RPOP and LPOP those values from the list; a FIFO queue would of course push from the right RPUSH and pop from the left LPOP or vice versa, while a stack would push and pop from the same side. You can create as many lists as you need to act as queues; they all live in the same key name space.

Say you want to have a single leader queue of all jobs.

LPUSH qleader "fold:specoffoldjob"  

You don't need to worry about creating the list; if the key doesn't exist its created. Let's put some more jobs in the queue:

LPUSH qleader "mutilate:specofmutilatejob"  
LPUSH qleader "spindle:specofspindlejob"  

Now a follower process is going to come along pop an item off our queue:

RPOP qleader  

Now this will return the right most item on our list. Our follower process can go and work on the job it has pulled out of the queue. But if our list is empty or doesn't exist it will return nil and our application will have to work out how to idle. Or it can user BRPOP instead which blocks when there's nothing to be obtained from the list. Unlike RPOP, BRPOP can also wait on multiple lists, returning the result from the first one to stop being empty so it's a handy way to wait on multiple queues for the first bit of work to come down the line. But wether polling using RPOP/LPOP or blocking with BRPOP/BLPOP, there's still the possibility that under load, a message in a queue could be lost due to restarts of a follower or losing network connectivity.

Enter the command that's just fun to say –RPOPLPUSH. This command, as it says, pops an item from one list and pushes it onto another list, all in one atomic operation and returning the item thats being popped and pushed. This is recommended by the Redis docs for implementing a reliable queue. In our example above we could maintain a list called qfollower1 and RPOPLPUSH our data into that...

RPOPLPUSH qleader qfollower1  

We can now work on the work item that came from the queue and when we're done, we can use the LREM command to remove that work from qfollower1 list. If we fail to complete work we've taken on, we'll find it in a non-empty qfollower1. It's simple and for a whole class of applications may be all thats needed for effective messaging; that and a good way to serialize your job information for queuing.

Remember this is just another thing you can do with Redis in your application toolbox.

Going Hyper

Some things Redis does are very specialized, yet enormously useful. Consider the Redis HyperLogLog. Trying to work out a unique count of things that your application has seen - be it logs, clients, sessions or something else - can be hard on your memory. Traditionally, you'd see if you had previously counted the item, by traversing a store of all the previously seen items, and if you hadn't, add the item to the store and bump the count of uniques up by one. That costs memory. In Redis, you could have added it to a set and got the cardinal count of that set, but again, memory.

Enter the HyperLogLog, from a paper by Philippe Flajolet and others. which talks about an algorithm which is near-optimal for working out the cardinality of a set of data. It's some wonderful work, but what's equally wonderful is that the Redis developers took that research and made it real. So there's the Redis HyperLogLog, a data structure which takes 12K of memory at most and can work out cardinality. So say you have usernames from your login system and other applications incoming. If you wanted a unique count for the day, for each username you get you'd call:

PFADD usercount:<dayofyear> <username>  

Where is a day count and username is the name. It returns 1 if the cardinality count went up, 0 if it remained the same. Here comes the caveat; that count will be an estimate with a margin of error of 0.81%. If that isn't exact enough counting for you, you'll be wanting to do use traditional counting and the associated memory penalty.

If it is good enough for you, keep calling code like that and when you want the unique count for a day, it's just:

PFCOUNT usercount:<dayofyear>  

To get HyperLogLog estimated count of unique values. But there's more. Let's make a quick simple example...

>pfadd usercount:101 bill grahame tim
(integer) 1
> pfcount usercount:101
(integer) 3

Three unique users on day 101...

> pfadd usercount:102 bill grahame tim john paul 
(integer) 1
> pfcount usercount:102
(integer) 5
> pfadd usercount:103 bill ringo 
(integer) 1
> pfcount usercount:103
(integer) 2

And 5 on day 102, and 2 on day 103. But what if we want to know what the unique count was for over day 101, 102 and 103. One of the HyperLogLog's attributes is that they are easy to merge, so we can just ask:

> pfcount usercount:101 usercount:102 usercount:103
(integer) 6

This gives us our unique count of the three days. If the derived value is something we're likely to want to come back to, we can merge through HyperLogLogs for future use, into a new key:

pfmerge threedays usercount:101 usercount:102 usercount:103  

which we can also get the PFCOUNT of:

> pfcount threedays
(integer) 6

We could, for example, create merges of whole weeks and months to make traversing the data at different resolutions easy.

Of course these examples are with minuscule quantities of data, the HyperLogLog can handle thousands and millions of records. As long as you're happy with that error margin, you'll find it's a low cost way of accumulating statistics.

Ready for Redis

One of the great things about Compose is that your own, production-ready Redis deployment is just a couple of clicks away. You can provision yourself one and start experimenting and understanding the power of the in-memory database and see where it could remove bottlenecks in your application stacks architecture.

If you're not a Compose customer, why not try a 30 day free-trial of Redis and all our other databases to help you find the perfect database foundation for your application stack.