RethinkDB & Compose Talk Databases & Hosting

Last week, RethinkDB's Slava Akhmechet and Compose's Kurt McKay sat down to introduce RethinkDB on Compose in a webinar. If you didn't have time to catch it, we have a recording of the meeting available on Vimeo. For those of you who prefer to read, we've also got a summary below.

RethinkDB on Compose demonstrated

The discussion opened with a demo: in the demo, Kurt showed how you could quickly configure a RethinkDB database on Compose and explained Compose was configuring capsules in the background to provide a two node RethinkDB cluster. He then demonstrated how it was just a few clicks to backup the database on demand and went on to configure the SSH access capsule. After passing on Mac OS X hint – the pbcopy command makes it easy to put command line data into the clipboard – he then connected to the RethinkDB administration UI where he created a new database and table and configured replication.

Moving on, Kurt then used the Compose Transporter for moving data from MongoDB to Elasticsearch – the RethinkDB version of Transporter is currently rolling into production – and configured it to extract "what we actually care about" from a document, using a small Transformer function written in JavaScript, and insert into the destination database. Slava asked about the particular data that was being transported and Kurt explained it was from one of our own internal Compose tools he works on for people to indicate what they were working on and to "give out awesome little votes to people who have done cool things - thats the level of tools I work on, we just give you the infrastructure to make it easy". He added that he's been porting the tool to RethinkDB and that the Transporter has made it easy to populate the database for that development.

Containers and Management at Compose

Slava said it was "pretty cool" how easy it is to make this run and asked how the Compose infrastructure worked, especially the virtualisation. Compose runs with vanilla LXC (Linux containers) Kurt said noting, "We've actually been doing containers since a lot longer before Docker existed". The scaling in Compose works by altering cgroups values based on database activity so as your data grows, more resources are allocated and, in the future, it will be able to react to more stimuli such as a processes failure to get more memory and rescale in response to that. All of that is orchestrated with custom management software which can administer operations like rolling version restarts, something that isn't in RethinkDB yet, by knowing how to stop, update and restart the different parts of a master/slave cluster.

As LXC doesn't have a nice command line API, Compose has developed its own internal tool. As Docker started as a nice command line for LXC, Kurt noted that Compose's tools were now named, tongue firmly in cheek, Focker. Focker is where all the tools are bundled up and, unlike Docker which focuses on packaging, it concentrates on the resource allocation and managment. The container instances are then connected via Open vSwitch which allows each customer to have their own private VLAN for their database.

Slava wanted to know how this worked in a distributed way, noting his experience in working with AWS. Compose actually runs with a sort-of-centralised broker for nodes as the management processes are not "super-intensive" such that they require distribution. It's more focussed on Compose's task though generic variants like Fleet and Google's Kubernetes exist. Early experience with AWS and MongoDB had shown Compose that customers were waiting a long time to see databases come online. To get the database us up faster, Compose makes the database UI accessible as soon as there's enough information to talk to the cluster rather than wait for it to be fully provisioned.

Kurt also talked about how much of the processing was done on Compose's own physical hosts in Virginia, pre-provisioned weeks in advance, how that pre-provisioning was anticipated and the challanges for Compose in balancing the allocation of capsules around the hosts. An example he gave was that if Compose were to run Redis, the provisioning system would put the Redis caspules on memory heavy hosts, while other databases (RethinkDB, MonogoDB and Elasticsearch) are provisioned with I/O operations as the priority and access capsules, such as the SSH and Haproxy capsules, go onto high CPU hosts.

Transporter and new databases

The discussion then moved to Transporter. Slava asked how hard it was to have a robust version of such a system online. Transporter is the third iteration of Compose data transfer technology said Kurt and recapped the history of development, from Seed, developed to move MongoDB data from an unsharded to sharded database, written initially in Ruby and then in Go and used internally. From there, Transporter was developed in Go and designed to be more robust. Working with Elasticsearch's pseudo-dynamic typed schemas was exceptionally difficult though, due to it making weak assumptions about types, and a "bad surprise" for any developer when they encounter it.

Slava was interested to find out more about how Compose brought new databases online. Kurt mentioned that there were three other databases running internally that "we'll probably release". The big priority with any database is replication because cloud customers don't accept losing access, so Compose focuses on how well replication works, what scenarios are the minimum safe effective configurations. Each database is different in how it replicates and Compose has learnt that it is important to learn each database's nature. Figuring out how to do effective, non-intrusive backups is also another thing that has to be established for any new database coming online. Finally, handling drivers, driver failover and database access and learning how to support developer's driver problems is the final component needed to bring a database online at Compose.

Network and business models

Slava and Kurt went on to discuss the Call me maybe problems that could occur in distributed databases. Slava said a lot of people, and database vendors, say that the problems mentioned there don't turn up. Kurt gave examples in practice, including some early MongoDB replication issues with secondary indexing and DNS failures that could result in "not-really a net-split proper" problems and how close the "Call me maybe" articles are to reality. With things like EC2 many database vendors were being exposed to issues that they had not anticipated compared to traditional physical configurations.

The conversation then turned to business models around database provision on the internet, with Slava enumerating various ways to get a database up and running. Comparisons with Firebase and others were rare said Kurt, noting that it was more likely Compose's ability to host and manage a database is often viewed relative to a developer spinning up their own EC2 database instance. What Compose offered was experience in running such things while maintaining the positive attributes of open source databases for the customer, something that wasn't an attribute of Amazon's proprietary DynamoDB – More people will be running open source as opposed to service-oriented databases because its "the lowest friction way to get technology into the enterprise" he added despite the massive scale offered by proprietary products.

3

"It's a running joke in the Big Data space that you could solve ninety per cent of Big Data problems on a MacBook Air" added Slava noting that people hedge for risk more. Predictability of open source databases at scale was also much easier now too, Kurt pointed out. Vendor specific databases could make sense though when talking about things like HIPPA compliance, when the problem isn't what database but achieving that compliance. For many developers though, open source databases are going to be their point of entry.

Slava wrapped up the discussion there, promising more webcasts, webinars and blog postings in the future from RethinkDB and Compose.