How to Handle Tagged Data with Redis Sets

Published

It's easy to forget that Redis is more than just an in-memory cache. Sets are a powerful data structure in Redis. Learn how to use them to your advantage.

Tagging records with a variety of labels is great, but it can make for a dataset that's tricky to handle. In this article, we'll show how you can use Redis and its set datatype to work with tags quickly and easily. We'll also share some commands that can be tricky with a more relational setup, such as finding the tags that two records have in common.

The project I'm working on includes links to some TED talks from an open data set that you can find on Kaggle. I really only want to work with the best-known talks, so I've cached the talks with more than 5 million views each (there are 99 of them, so nice and small) into Redis. By duplicating the data into Redis I can quickly access it, looking up more of the talk details later if appropriate.

Sets in Redis

Sets are rarely talked about, but they and their sister datatype sorted sets are very powerful. A set has a key (I've used the talk title) and then any number of values. The values aren't sorted, but they can't contain duplicates.

For example, one of my favourite TED talks is by the late great Hans Rosling (if you haven't seen it go watch it now, the content there is more important than anything I can tell you about Redis. Really). The talks, including this one, are stored in Redis with its tags as a set. The key is talk:Hans Rosling: The best stats you've ever seen. To see the values in this set, use the SMEMBERS command:

sl-us-south-1-portal.16.dblayer.com:29478> SMEMBERS "talk:Hans Rosling: The best stats you've ever seen"  
 1) "Asia"
 2) "health"
 3) "Google"
 4) "demo"
 5) "economics"
 6) "global development"
 7) "Africa"
 8) "math"
 9) "statistics"
10) "global issues"  
11) "visualizations"  

Adding values to the set is done with the SADD command. One feature of the set datatype is that it doesn't allow duplicate values, so I can add a tag once and get a positive 1 response, but if I try to add a tag that is already in the list, I'll get a 0 response instead. Let's see an example of this, and the updated tags list:

l-us-south-1-portal.16.dblayer.com:29478> SADD "talk:Hans Rosling: The best stats you've ever seen" awesome  
(integer) 1
sl-us-south-1-portal.16.dblayer.com:29478> SADD "talk:Hans Rosling: The best stats you've ever seen" health  
(integer) 0
sl-us-south-1-portal.16.dblayer.com:29478> SMEMBERS "talk:Hans Rosling: The best stats you've ever seen"  
 1) "global development"
 2) "Africa"
 3) "math"
 4) "Asia"
 5) "health"
 6) "statistics"
 7) "global issues"
 8) "economics"
 9) "demo"
10) "Google"  
11) "awesome"  
12) "visualizations"  

These features make the set datatype ideal for working with this type of data, just a list of values. The values have no order—and if you look at the output of the SMEMBERS call here and the one further up, you can see that the tags are returned differently sorted each time.

Secondary indexes: same song, different tune

Since Redis is mostly used as a cache or for storing disposable data, it's not unusual to denormalise or duplicate the data here so that we can access it in multiple ways, but really, really fast. For talks with tags, it is common to want to fetch some data by tag as well as by talk. It's a very common pattern, and I like to solve it with a "secondary index": put simply, the same data stored again in a set but the other way around!

In addition to sets with a talk name as the key and the tags as the values, I've stored another set of records with the tag as the key and the talks that have this tag as the value. For example, here's the tag:math record

sl-us-south-1-portal.16.dblayer.com:29478> SMEMBERS tag:math  
1) "Hans Rosling: The best stats you've ever seen"  
2) "Arthur Benjamin: A performance of \"Mathemagic\""  
3) "Stephen Hawking: Questioning the universe"  

Using this second set of set datatype records, we can answer some new questions.

Picking a random talk by tag

Let's start by taking a quick look at the tags we have. They're all named with something starting with tags:, which means we can SCAN with a MATCH parameter to find them.

sl-us-south-1-portal.16.dblayer.com:29478> SCAN 0 MATCH tag:*  
1) "16"  
2) 1) "tag:student"  
   2) "tag:mindfulness"
   3) "tag:performance"
   4) "tag:robots"
   5) "tag:data"
   6) "tag:evolution"
   7) "tag:demo"

The SCAN command returns us a few records at a time, and the first argument is a "cursor". The first time we call it we can send zero, then the result of that call tells us the cursor value to use to get the next batch of records. In the example above the cursor value is 16, so my next command is:

sl-us-south-1-portal.16.dblayer.com:29478> SCAN 16 MATCH tag:*  
1) "8"  
2) 1) "tag:health"  
   2) "tag:medical imaging"
   3) "tag:work"
   4) "tag:spoken word"
   5) "tag:economics"
   6) "tag:writing"
   7) "tag:youth"
   8) "tag:magic"
   9) "tag:prosthetics"

Enough messing about with the SCAN command. Let's look at the data again and check out what else is tagged "health". I can get all the values with SMEMBERS, similar to the examples above:

sl-us-south-1-portal.16.dblayer.com:29478> SMEMBERS tag:health  
 1) "Russell Foster: Why do we sleep?"
 2) "Shawn Achor: The happy secret to better work"
 3) "BJ Miller: What really matters at the end of life"
 4) "Jane McGonigal: The game that can give you 10 extra years of life"
 5) "Andy Puddicombe: All it takes is 10 mindful minutes"
 6) "Hans Rosling: The best stats you've ever seen"
 7) "Daniel Levitin: How to stay calm when you know you'll be stressed"
 8) "Robert Waldinger: What makes a good life? Lessons from the longest study on happiness"
 9) "Jamie Oliver: Teach every child about food"
10) "Kelly McGonigal: How to make stress your friend"  
11) "Judson Brewer: A simple way to break a bad habit"  

How about picking a random talk? The command is called SRANDMEMBER, and I can use it like this (running it a few times to show that you can get either of the values as a result):

sl-us-south-1-portal.16.dblayer.com:29478> SRANDMEMBER tag:health  
"Hans Rosling: The best stats you've ever seen"
sl-us-south-1-portal.16.dblayer.com:29478> SRANDMEMBER tag:health  
"Robert Waldinger: What makes a good life? Lessons from the longest study on happiness"
sl-us-south-1-portal.16.dblayer.com:29478> SRANDMEMBER tag:health  
"Shawn Achor: The happy secret to better work"

This is great! I can very easily pick a random talk to display for any given tag, offering relevant content to the users of the site I'm building.

Which tags do these talks share?

Now for the magic sparkly dust! Redis can combine set-type records and return information about:

So if I want to know which tags two of these talks have in common:

sl-us-south-1-portal.16.dblayer.com:29478> SINTER "talk:Robert Waldinger: What makes a good life? Lessons from the longest study on happiness" "talk:Shawn Achor: The happy secret to better work"  
1) "TEDx"  
2) "happiness"  
3) "health"  

Looks like these health talks deal also with "happiness", and they're also from TEDx events. Seeing this overlapping tag data helps evaluate what the talks have in common and which data might be most relevant to show next to the user.

Redis the super cache

Hopefully this has shown you some of the tricks you can play with Redis, and specifically with the often-overlooked set datatype. Duplicating data in this way with an index plus a do-it-yourself secondary index really opens up the types of questions we can answer.

It's also common to duplicate the detailed records themselves in Redis, especially when working with such a small set of data. This example used only the most-viewed hundred talks, and with only the title, description, tags and URL fields, it's plenty small enough to put all of it in Redis.

Redis by itself is awesome and the sets are just one of many excellent features. To read more about commands for working with sets, check the Redis command documentation for sets—then let us know what you build!

Lorna is based in Yorkshire, UK; she is a Developer Advocate with IBM Watson Data Platform, a published author and experienced conference speaker. She brings her technical expertise on a range of topics to audiences all over the world with her writing and speaking engagements, always delivered with a very practical slant. In her spare time, Lorna blogs at http://lornajane.net.

attribution Felix Mooneeram

Conquer the Data Layer

Spend your time developing apps, not managing databases.