Mastering Redis high-availability and blocking connections

Published

If you've developed your Redis application on a local single Redis instance, the move to Compose Redis may expose some assumptions about Redis behavior that you've inadvertently baked into your code. These assumptions can manifest themselves in the form of poor performance.

The biggest assumption is that Redis is either running or not running, which is a reasonable assumption when you have a single instance of Redis. When you have a high availability configuration for Redis, you need to be aware that the Redis you are talking to may, or may not be, the master and if it isn't you can't write to it, but you'll only know that when you write to it. If that sounds a little "Schroedinger's Cat Picture Database", let us explain further.

High Availability

With a high availability configuration of Redis, there are two or more, Redis servers. One is the master, the others are replicas of the master. There's also sentinels, processes which specifically check on the availability of the servers. When the sentinels agree that the master is unavailable, the sentinels step in and make a replica into a master so things can continue running. To connect to the master server, at Compose we use a HAproxy process. This HAproxy tests all the nodes to see who is master and directs new traffic there.

Let's look at that in a practical way. The master server, for whatever reason, is shut down, dropping all the incoming connections to the server. The sentinels note that their heartbeat checks on the master are failing. The sentinels wake up a replica and promote it to master. The Haproxy sees the change on it's next health check pass and starts directing connections to the new master. When the previous master server returns to availability, the Compose platform ensures it returns as a read-only replica. Simple and effective....

Offline but not offline

Apart from one thing. You can make any Redis server look like its offline by just forgetting that the server is single threaded. Input and output to the server are handled away from it, but at Redis's core is a single thread which does all the database work. This is one reason why native Redis commands can usually be treated as atomic operations - the database basically blocks, processes the command, pushes the results to I/O then releases and moves on to the next operation. Redis is an in-memory database so processing the operation should be quick for most commands.

But there's a class of commands on Redis that can potentially take a long time to complete. Top of the list is the KEYS command. "Returns all keys matching a pattern" says the documentation. Sounds very useful, but if you read on you'll see it comes with a warning; "should only be used in production environments with extreme care" and "may ruin performance when it is executed against large databases". The chances are that if you did use KEYS in your code, you may not notice the problem in testing – unless you test against a copy of a production database - something Compose can make easy for you to do. In production, as your data grows, so the calls on KEYS will become more and more punishing to performance.

So punishing that you'll hit the point where the sentinel can't get a response to its heartbeat check. In a single Redis instance, this point would just be "Ah Redis is running super slow and isn't taking connections as quickly as it should". For the sentinels though, it's an alarm. The master database is down and it's time to switch. A replica becomes a master while the previous master becomes a replica when it, eventually, responds again. The HAproxy, meanwhile starts sending new requests to the new master. And all is as well as can be expected except... if you have pubsub commands or blocking requests outstanding.

The Blocking Problem

The BLPOP, BRPOP, BRPOPLPUSH commands are all super-useful commands for handling queues which can block. The pubsub commands are useful for message passing. The thing about these commands though is that they hold a connection open to the server.

For example, BLPOP can sit on an empty list and wait for something to be added to the list, at which point it'll pop it off and return that - if there's something already in the list, that's popped off and returned immediately. Where there is only single Redis server, if the server goes down, the connections are severed and errors are thrown in the client. Where there's a high availability configuration, these connections can stay open though.

Most connections to the unresponsive master should be killed, but long running connections like this are hard to kill being being effectively running commands. Eventually, something will update a list on the new master, it'd be replicated to the now-replica server and the BLPOP would attempt to remove the first item from the list... and fail because the replica is read-only. Even if the connection were killed by the server, there would still be an error thrown by the disconnection.

If your application isn't catching errors that can occur when using these blocking or pubsub commands, then you will fall foul of failover. These commands should, at a minimum, be wrapped up to catch errors and look to reconnect. The act of reconnection will send the blocking command over to the new master. As for the connections staying open? Well, this is why there's a timeout option on the blocking commands. By setting the timeout, you can ensure that the blocking command will regularly complete and then you can reissue the command if no result is forthcoming. The reissue of the command should be enough, in conjunction with the previous error checking, to ensure a reliable experience. Just because the server configuration is highly available doesn't mean that errors will never occur, especially when failover happens with open connections to the failing master.

Fixing the source of the problem

We started this with the idea of a server that just appears offline because it is being given lots of long running KEYS commands. The way you find out wether this is what is happening is the check your Redis slowlog. You can use the SLOWLOG command or, on Compose, use the Slow Log viewer in the web console. You can read about that in a previous article in Compose Articles. The SCAN command can do pattern matching in a way that doesn't block the server and returns data in more manageable chunks. And it's worth considering whether the problem is amenable to using Redis sets, which you can also scan using SSCAN. Redis is, as described on its home page, an in-memory data structure store and it's always worth seeing if you can make better use of the available structures.


If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com. We're happy to hear from you.

Image by Paul Smith
Dj Walker-Morgan
Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer since Apples came in II flavors and Commodores had Pets. Love this article? Head over to Dj Walker-Morgan’s author page to keep reading.

Conquer the Data Layer

Spend your time developing apps, not managing databases.