Your Drivers Are Not Magic - Testing Your Application for High Availability

It is remarkably easy to build a MongoDB-using application which doesn't handle the day to day high availability issues of a MongoDB database. When there's only one server, it's either up or down, but when there's a replica set, things get more complex. Unfortunately many applications are written with a single server in mind and never make the changes needed to adapt to the new more resilient environment.

The line of thinking behind such applications can go "My database driver is full of magic and will handle everything". The more considered line of thinking is "My driver knows how to handle replica sets and step downs so I don't need to worry about it - the next call to the database will go to the right server". Both lines of thinking don't survive contact with reality though. MongoDB implements and Compose deploys replica sets for MongoDB - with at least two members in the replica set - a primary and secondary. When maintenance is done on the primary member or it becomes unavailable for some other reason, the secondary automatically steps in to take over.

Replica Sets to connect

First of all, most drivers need to be told they are talking to a replica set and at least be informed of some of the hosts that are involved in making the MongoDB database available. That's why it's important to use a replica set URI which you can get from your Compose dashboard. Here's an example:

mongodb://user:pass@c343.lamppost.3.mongolayer.com:10343,c386.lamppost.2.mongolayer.com:10386/recordings?replicaSet=set-54d4ca1c4f4426ab04000bb3  

As you can see, there are two hosts using different ports. If you use this when connecting to your database, the driver can now step through the hosts till it finds a primary database server, verify it is part of the replica set named and off it will go. For this example, we'll insert 10,000 records and that gives us, in Ruby, a program that looks like this:

require "mongo"

mongo_client=Mongo.MongoClient.from_uri("mongodb://user:pass@c343.lamppost.3.mongolayer.com:10343,c386.lamppost.2.mongolayer.com:10386/recordings?replicaSet=set-54d4ca1c4f4426ab04000bb3")

db=mongo_client.db("recordings")  
coll=db.collection("records")

for i in 0...10000 do  
     coll.insert( { "counter" => i, "named" => "name#{i}" })
end  

Run this and the driver will use the hosts from the URI to establish which server is primary and connect to that.

Now, although it's neat and will find the right server in a replica set, consider if this was production code and we're constantly inserting data... and the replica set has to step down a primary server and let a secondary server step up to primary. You might hope, or assume, that when that happens the same slight magic that happened on connecting would happen again. But thats not the way the real world works. It's easy to spin up a new deployment on Compose to test this; in fact that URI for the database is pointing at our test deployment. It is also easy to force a step down to take place on a deployment. Just go to the deployment Settings after running your program...

$ ruby hotstepper.rb

We click "Trigger Stepdown" here and after a few moments...

/Library/Ruby/Gems/2.0.0/gems/mongo-1.12.0/lib/mongo/networking.rb:338:in `rescue in receive_message_on_socket': Operation failed with the following exception: Connection reset by peer (Mongo::ConnectionFailure)
  from /Library/Ruby/Gems/2.0.0/gems/mongo-1.12.0/lib/mongo/networking.rb:330:in `receive_message_on_socket'
  from /Library/Ruby/Gems/2.0.0/gems/mongo-1.12.0/lib/mongo/networking.rb:191:in `receive_header'
...
$

The program lost contact with the primary server, triggered an exception and exited. The problem starts with the assumption that no calls would be in progress when a server goes offline.

Re-retry

When the connection breaks, without any error handling you'll get an error like this; in Ruby terms, they raise a ConnectionFailure exception. If you can catch that and retry you might think you'd be fine. Let's replace the inside of our for loop and capture that exception wait half a second and try inserting again...

begin  
  coll.insert( { "counter" => i, "named" => "name#{i}" })
rescue Mongo::ConnectionFailure => ex  
  sleep(0.5)
  coll.insert( { "counter" => i, "named" => "name#{i}" })
end  

No, you'd never willingly write code like this and you'd see when you step down the primary, it falls over too. You see, once the driver has figured out there's a problem, the next call will go to the new primary. But picking the new primary server takes a non-negligible amount of time and once the secondary is selected to step up, it has to step up. Once it's up, the driver will find it and all will be well. Ruby programmers might suggest swapping that second insert for retry like so...

begin  
  coll.insert( { "counter" => i, "named" => "name#{i}" })
rescue Mongo::ConnectionFailure => ex  
  sleep(0.5)
  retry
end  

Which will work, but on the day there you lose connectivity to the internet from your app, you'll find your app spinning in retry loops.

An Elegant Ruby Retry

To avoid spinning in loops, you'll need to keep track of how often you've retried. One Ruby solution, from the MongoDB Ruby driver documentation introduces a useful function, rescue_connection:

# Ensure retry upon failure
def rescue_connection_failure(max_retries=60)  
  retries = 0
  begin
    yield
  rescue Mongo::ConnectionFailure => ex
    retries += 1
    raise ex if retries > max_retries
    sleep(0.5)
    retry
  end
end  

This will retry an operation up to 60 times by default. Then all you need to do it wrap your database operations like so:

for i in 0...10000 do  
  rescue_connection_failure do
    coll.insert( { "counter" => i, "named" => "name#{i}" })
  end
end  

And the code will retry through a step down sensibly. Then all you have to do is safely reproduce that technique in all your application code. If you are using an ORM or some other database layer, you'll have to check how it handles connection failures too.

A Touch of Go

Depending on your language and driver, you'll find that handling a connection failure won't all be about repeatedly retrying though. Our example above is in Ruby, but here's the same idea in Go.

package main

import (  
  "fmt"
  "os"

  "gopkg.in/mgo.v2"
  "gopkg.in/mgo.v2/bson"
)

type entry struct {  
  ID      bson.ObjectId `bson:"_id"`
  Counter int           `bson:"count"`
  Named   string        `bson:"named"`
}

func main() {  
  session, err := mgo.Dial("mongodb://user:pass@c343.lamppost.3.mongolayer.com:10343,c386.lamppost.2.mongolayer.com:10386/recordings")
  if err != nil {
    fmt.Println(err)
    os.Exit(1)
  }
  coll := session.DB("recordings").C("records")
  for i := 0; i < 10000; i++ {
    doc := entry{ID: bson.NewObjectId(), Counter: i, Named: fmt.Sprintf("name%d", i)}
    if err := coll.Insert(doc); err != nil {
      session.Refresh()
      err = coll.Insert(doc)
      if err != nil {
        fmt.Println(err)
        os.Exit(1)
      }
    }
  }
}

The Go mgo driver is a very advanced driver for MongoDB. It can discover the members of a replica set from one host, but it's best to list a number of hosts in the connection URI - relying on one host for discovery can make the connection process fragile at best. The mgo driver also knows about step downs and can handle nearly all of the issues. The only thing that needs to be caught is an error from an ongoing insert...

if err := coll.Insert(doc); err != nil {  

But rather then just retrying repeatedly, an mgo-using application need only refresh the session to the database and try the operation again. If it results in an error again, the likelihood is that there's a more major problem ongoing rather than a step down.

session.Refresh()  
err = coll.Insert(doc)  
if err != nil {  
    fmt.Println(err)
    os.Exit(1)
}

The session.Refresh() clears the state of the connection allowing the subsequent Insert() to trigger re-establishing the connection.

Test, Test and Test

What this all comes down to is making sure in your code that your error handling for database calls works. Even if you have a replica-set-supporting driver and a replica set URI, unless your database interactions can be made to handle a step down, either by retrying the operation repeatedly or stepping back to allow a new connection to be established before retrying, you run the risk of your application either losing data or exiting with an unhandled exception.

If uptime and reliability are important to you, make sure your application handles at step downs as they are an inherent component of MongoDB's high availability solution. With Compose, we make testing with real data easy. Put together your staging/test application, then using Compose's dashboard restore a backup into a new MongoDB deployment to create an accurate snapshot of your production DB to test against. You will now be able to point your staging/test application at that independent copy of your database and simulate activity within it.

When you are ready, go to Settings for that test deployment and hit Trigger Stepdown to test your code's capabilities. And repeat the process while exercising your applications database access. If you've got any weak spots in handling some should show up through this process, but there's only so much that you can test for. Do remember to review your database access code too - it is the most important part of your application.