The Compose Platform and the Cassandra Challenge

Bringing new databases onto the Compose platform isn't simple, but that's because we need to make them work with our architecture and sometimes the architecture of the database can clash with the way the Compose platform works. When we're asked if we are bringing a particular database to Compose, the answer is we're always working on new databases for Compose but until they are working, and working well, we're not going to make them available as a managed option.

To shed some light on what that means, we talked to Nick Stott who lead a recent developer sprint to see if Cassandra would work well enough on the Compose platform.

(TL:DR; No, Cassandra doesn't work well and we aren't announcing anything Cassandra until it does).

Dj: What's neat about Cassandra that led to you wanting to see how it would fit with Compose?

Nick: Cassandra is a nice eventually consistent database. It works well at scale, with tens, hundreds or thousands of nodes, and is supposed to scale linearly. Part of the way this architecture works is that it allows you to tag hosts based on where they are located. So you could, for example, tag all the hosts in us-east-1a, us-east-1b, eu-west and so on.

Dj: Aren't they the names of the different Amazon data centres?

Nick: Right, that's just an example. The Cassandra drivers are then expected to make intelligent choices about which hosts to connect to based on certain policies. Host discovery is a big part of this logic. A policy might be something like: 'connect to a seed host, auto discover hosts, and then connect to hosts only in us-east-1'. Or, perhaps: 'connect to any hosts that contain this subset of data, and with the lowest latency'.

Dj: That is a clever way of building a distributed database.

Nick: Yes, but for us, auto discovery is a problem. We have 3 different classes of IP addresses in AWS. We have the AWS public address (54.x.x.x), we have the AWS private address (172.x.x.x), and we also have the deployment's own VLAN (10.x.x.x). So, while we want the replication to happen across the private VLAN, 10.x.x.x, when a driver connects to the cluster through an AWS public address, 54.x.x.x, it ends up auto discovering addresses in the private VLAN range.

Dj: Skipping over that for a moment, how was it to set up?

Nick: Our normal process with Cassandra was pretty uneventful. I wrote a config in gru...

Dj: Gru?

Nick: The orchestration framework we developed to manage capsules. I wrote some recipe configs and decided to start testing with a three node Cassandra system with two haproxy portals for access.

Backups were a pain. There is no cluster level backup for Cassandra, because it's eventually consistent. The stock process for backups seems to be triggering a local backup at more or less the same time, and on restore, the eventual consistency will work through any differences in the state of the snapshot.
I wrote some code to coordinate this, and let us take a cluster level snapshot at the same time.

Dj: Anything else different?

Nick: Gathering metrics was unusual. Cassandra doesn't expose things like heap and the like through an api, instead it exposes these metrics through a JMX. Command line clients to gather this usually involved spinning up another JVM to connect to the JMX, then suck off data. This was messy, so I found a way to expose these metrics through HTTP with Jolokia, and wrote something to connect and expose these metrics to a command line client.

Everything seemed to work at this point, backups were being taken, health checks were working, provisioning and so on. It was sweet as candy.

Dj: I sense a but coming?

Nick: There is a GUI tool for Cassandra put out by Datastax called the Datastax Dev Centre. I could connect with the dev centre and query Cassandra, and although things seemed to be working, queries were ridiculously slow. We are talking reads and writes of one small document taking more than a second. This was disconcerting and I wanted to dig into that a little bit

Dj: How were you going to do that?

Nick: Benchmarking Cassandra is a pain. The usual tools like ycsb and jmeter don't work well with the versions of Cassandra we're running – 2.1.x and up. There were some open issues about this but I managed to dust off my mvn skills, do some hackery and get a crippled version of ycsb able to connect. It was able to read/write from my workstation at about 30 ops/second. That was way too low. I chalked this up to me breaking something in the benchmark, and moved on. I wrote some crude benchmarking in Go so we could to get a different take on the metrics.

I ran the go benchmark from a few different places. Locally from within the container I was able to get about 600 op/s. From another host on AWS I was able to connect through the haproxy and get about 300op/s. From my workstation I was able to get only 20op/s which was still too low.

Dj: Did tuning the system help?

Nick: I spent a couple of days trying to tune the Cassandra cluster. I got the Cassandra change version happening which let me bench Cassandra 2.2.1 against the other version (2.1.9) and got roughly the same answer. There definitely seemed to be something else going on here. All the tuning I did do on the database only changed the performance a few percent and it didn't make a dent in the problems I was seeing.

Dj: What was next in the process?

Nick: I thought that the host discovery might be breaking things in the usual ways - client machine discovers 10.x ip addresses and tries to connect to them.

Dj: That's 10.x IP addresses coming from the private VLANs our deployments run on yes?

Nick: Yes, so I turned off host discovery in the Go driver, and dug out Wireshark and tcpdump to try to analyze the traffic. I didn't see anything 'bad' happening; I didn't see the normal problems I would expect to see from host discovery. There were a lot of re-authentications, but nothing that would explain the bad performance.

I spun up an SSH capsule, and bench'd through that, and saw virtually the same perf as through haproxy. That was an unfair test though, as we were still playing IP trickery, connecting to 127.0.0.2 while the server is on 10.x.x.x. I tried aliasing the same 10.x IPs to my lo0 just to see if I could trick the driver, but there was no performance change at all.

Dj: What was your next line of attack?

Nick: I began thinking that the portals were perhaps causing part of the problem. I decided to expose ports on the data members, and do replication over the container's eth0 (192.168.x, br0, etc). It would be fine for testing but this would cause a few problems in production because:

  1. There is no authentication on the inter-server replication, we'd have to write something.
  2. Cassandra demands that all servers expose the same port across all nodes, and we don't currently do that.

So, I punted on 1 and 2 just in order to test and see whether this would fix the problem. I exposed a high port in a Cassandra cluster and tested directly against the cluster, both with host auto discovery and without host auto discovery enabled in the driver. And there was no change. Yey.

Dj: Oh.

Nick: The Freenode hivemind gave me some ideas though. Apparently the 'magic' with Cassandra happens when you don't serialize the queries, and just 'let it flow'. Apparently, too, the Go driver doesn't do this, and I should have been running benchmarks with a tool called cassandra-stress instead. This seemed promising, so I cranked that up, and after some hours of breaking it, managed to get it running in the container. Eventually I'm able to get amazing performance, some 20k ops/second inside the container. This was really promising.

I get this running on my local machine, and it breaks immediately. We needed to expose several more ports on our Cassandra nodes to get this running – the thrift port, the replication port, and the native protocol port all need to be enabled and exposed at this point. After those ports are enabled, I was able to run the Cassandra stress locally, and was seeing roughly the same performance as I was with the Go tool, ~20-30 ops per second, and a large number of failed writes.

It was at this point that I decided that Cassandra was dead to me. But I still had one trick up my sleeve, this approach seemed to be promising - this fellow is able to run a node without data, and just use it to manage the client connection policies. I got that running, and did some quick benches, but didn't see any significant improvements – things still seem to be crippled.

Dj: So whats the diagnosis?

Nick: All in all, I think that the 'smart clients' and the reliance on host auto discovery combined with the 3 layers of ip addressing between the client and the node is the root of this problem.

While I think that it's possible that some random configuration tweaks could get a few more ops/s, the fact that we're two orders of magnitude away from the local benchmark indicates that something is 'broken', rather then just 'sub optimal'.

Dj: What's next for Cassandra then?

Nick: I am planning on revisiting Cassandra when we have new dedicated Gru hosts. With those, I think we can strip away some of the layers of network at that point and potentially simplify our platform to a point where Cassandra's smart drivers don't break the world.

Dj: Thanks Nick.

So, we will be coming back to Cassandra, but not right now. If you are a developer, love Cassandra and want to see it on the Compose platform, come chat with us in our Freenode IRC channel #compose. We are always looking to hire engineers with the right skills for Compose.

Finally, we must point out that Nick wasn't the only developer at Compose who was engaged in a development sprint. We'll be talking more about that in the very near future.