Be prepared: High availability and you

There are some things you can't avoid on the internet: geography, topology, and things breaking. And as your applications and databases are on the internet, it means you can't avoid them either. If you want to maximize your uptime, you need to take all these things into account. Let's start from the end of that list.

Things Breaking

High Availability means your database is highly available. On Compose, we build databases so that if any node goes down on the database, the database is still up and accessible because the other nodes will step in to take on the work while its recovering. We make sure those nodes are spread over availability zones so that one zone going down does not take a database deployment down.

Similarly, if any access portal goes down, there are one or more access portals set to take over the load, but we can only do a third of the work here, making our end as available as possible. The other two-thirds of the work falls to application developers and other database consumers.

That work is about remembering not to assume success in database calls. Too often code is written that opens a connection to the database and then loops until the heat death of the sun (or while true, whichever is sooner), performing a range of database reads and writes.

If, before now and the heat death of the sun, a portal were to fail, then this code would timeout or error out and it wouldn't be pretty. Worse still, you might rerun the code and get it to connect to another portal and carry on with a now undetermined state in the database leading to errors later on.

So one thing that's wrong here is that the driver in use doesn't know or know how to failover to different access portals, or it just hasn't been given enough information to use all the portals. It'd also take some time to detect failure; one thing TCP/IP networking doesn't do is timeout quickly.

So the first thing to fix, if reliability is important, is to use a database driver that knows how to failover to a working portal. That means a driver that can detect when a call to the database has failed and switches over to another portal for connections. Without that, you are not using the high availability facilities. Note that many drivers will still throw an error when this happens so you'll want to be able to handle a failover event cleanly.

Depending on your language, there are various ways to efficiently implement this. You could pass a function to a "safe execute" call which does the work. You could extend the driver with a shim which takes care of the retrying. Whichever way, you need to ensure that that's safely spread through your code too (and it's a good opportunity to abstract your database code from your business logic).

If your driver doesn't know how to failover or there isn't a driver on your platform/database combination that doesn't do failover, your next stop is rolling your own error handling. This means, at its simplest, catching errors from database calls and then trying the next connection on your list and repeating the call till it doesn't error.

Topology

Your database connection can fail in other ways though. That could be because the internet routing or some element in the connection between your application and the database deployment has gone down. No amount of failover is going to help if your local network, or any part of your route to the database, is out of action.

One myth of the internet is that it "routes around damage". The internet doesn't do that. The modern internet infrastructure is built where there's market demand and there's no "always routable" guarantee, just human and digital endeavors to keep as much routable as possible. Remember that it takes just one injected BGP route to cripple a nation states network. On a smaller scale, it can take just one misbehaving switch in a telco to make connections erratic.

That means you have to consider whether you can afford to set up a no-downtime connection or just architect your application to sensibly handle a complete loss of connectivity. Sensible can range from managing being offline safely and sending appropriate alerts early.

Or it could mean tying your application monitoring to your infrastructure's network routing and switching to a different connection or other strategies for getting an alternate route. These things will cost time and money and the kinds of failure are remarkably hard to predict. That's where you need to balance those costs against the actual risks and business impacts.

Whichever way, these problems will manifest themselves as things breaking, so make sure you have that covered first.

Geography

A common assumption we come across at Compose, especially when benchmarking databases, is that geography has no impact on performance. Thanks to the speed of light, we know that's not the case and signals can only travel so quickly over the net leading to latency. The other source of latency is the devices a signal will meet en route. The further away, the more devices, the slower the signal. Both benchmarking and engineering for resilience should be done with this in mind.

Where your application is, relative to where your database resides, will set the limit on how quickly transactions can be done and how fast you can expect monitoring calls to respond. The further away from your database your application the longer a complete round trip will take, so use bulk operations to do as much work in a single trip as you can.

When measuring performance, use a tool like mtr to give you a better view of how your network connection is performing. Use it to map your connection and always assume that a different location, be it datacenter, offices or home, will have a completely different route to the destination. Don't assume geographic proximity makes for similar routes, especially with domestic broadband.

At the end of it all though, the simple rule is to reduce the geographic (and topographic) distance between your applications and your databases. Site them in the same region at least, if you don't want to be exposed to internet weather. That's the same weather that can take out part of a connection (see Topology) and so fixing things to prevent damage from it can be a benefit.

In conclusion

At the architecture level, look to reduce the complexity of the topology of your systems for best performance. The simplest way to do this is to move them closer together. If possible, get the least number of moving parts between the systems too so you lower the chances of a failure, but don't forget you'll still need redundancy for when they do.

At the software level, at least make sure you actually use the high availability features of your database. Use failover connection strings and drivers that understand them. If you want to be more bullet-proof, assume every database call can fail and make sure you wrap all the calls with appropriate code to force failover and where that doesn't work, sensibly step back and raise the alarm. Oh and let it keep checking, not with traffic but with simple query probing, for the database so it can resume as soon as the database is ready.


Read more articles about Compose databases - use our Curated Collections Guide for articles on each database type. If you have any feedback about this or any other Compose article, drop the Compose Articles team a line at articles@compose.com. We're happy to hear from you.

attribution Rich Lock