Recently, Compose co-founder Kurt Mackey and RethinkDB co-founder Slava Akhmechet sat down with an audio recorder and talked databases and queries.
Note: The audio for this is no longer available, but we preserve the notes for reference.
ComposeCast Beta #1 Notes
It's all about the chains
The discussion began with talk about the foundational goals of RethinkDB and how the query language developed. Slava talked about how jQuery's function chaining inspired the RethinkDB's ReQL. In queries this meant creating chains of selectors creates an intuitive data flow. He also discussed how the protocol beneath has made writing drivers harder and where they've simplified it to make it easier for community driver developers. This led into a look at the lambda functionality of RethinkDB which pushes query functions into the database.
Kurt moved the discussion onto the problem of abstraction in frameworks like ActiveRecord and how such frameworks can make developers feel comfortable writing queries which cause problems when they hit the actual database. That's not a problem for RethinkDB so much said Slava, because it does make people think more and be more explicit over their queries. "You get an intuitive sense of how things get executed" he said, so you can see how the data flows through the query rather than allowing a library to obscure how many joins the database is going to need to do. Noting that "In realtime systems you almost always want some degree of explicitness" to get predictability in queries and the ReQL query language is all about intuitive explicitness.
Query or update or?
This led into a discussion about how query and update operations in databases are unified within RethinkDB, rather than, as some databases do, treating each as similar but different operations. Slava explained how query operations return sets of results as selection-typed data and those are inherently updatable. When an operation creates an aggregation, projection or join, these are called streams and these aren't updatable.
The conversation moved on to the idea of treating database query languages as set theory and Slava revealed that this is a guiding principle of the ReQL design process. One complication though is sharding and Slava outlined how a RethinkDB handles a cross-shard query and update operation using analysis and routing of the parts of the query.
Kurt compared the focus on ensuring data locality in Spark and how RethinkDB does its query/update handling and asks whether aggregations can be executed locally on the shards. All the built-in aggregators are able to execute locally except for average, where it has to be more carefully done, calculating a sum and count on each shard and passing the results back to the original node where the results from the shards are used to calculate the final average.
Do it here, there or everywhere?
The subject of spreading load over the shards leads on to talking about how RethinkDB's map/reduce has evolved from a single method,
GroupMapReduce, which took three functions, and how it became a chained set of methods which can be run independently or combined to get a powerful solution. Slava detailed the limits on this process, namely ensuring that the operations are safe.
The map/reduce functionality is often used behind the scenes responded Slava, after Kurt asked about how many people use the functionality. This makes it hard to work out how many users actually use map/reduce, whether explicitly or not, though Slava is fairly confident that most common tasks which need map/reduce are already handled and that Rethink's goal is that people should never need to explicitly call on map/reduce functionality by making sure there's a higher level tool to handle that. Map/reduce is a "wonderful piece of infrastructure" but that "you don't want to be writing for it" as it's "not a great way to express your queries" said Slava.
Recalling the Phil Karlton quote "There are only two hard things in Computer Science: cache invalidation and naming things", Kurt and Slava considered how people are rediscovering and reinventing older ideas like abstracting complexity away into higher level tools and relearning that understanding the underlying system helps you use the first iterations of high level tools.
Time series tunnels
Consideration is then given to the question of whether RethinkDB could be as good a time-series database as any of the dedicated time-series databases out there. Slava thinks that dedicated databases tend to be better, but that in most cases, people don't use a dedicated database and instead push their business data alongside say a time series of apache logs within the same database. He is confident that they can usually get RethinkDB to cover 90% of these use cases.
He used the example of binary blobs as a feature in RethinkDB and how customers are "super happy" with it - it doesn't store massive video files, but for most users it's enough. Initially though, the RethinkDB developers resisted implementing it because of the potential negative impact it could have. The same, he says, goes for the geospatial functions and covers 95% of cases for mobile. If you want something more than that, you are probably going to want to go dedicated. Kurt notes a parallel issue for Compose: working out when customers are going to need to need to switch to a dedicated solution.
For his example, Kurt picked at graph problems and said that, deep down, many systems have a graph problem at their core. Many people, though, end up writing odd code to map the graph in a general purpose database without realising what kind of problem they have.
Slava concurred, noting that RethinkDB boundaries are pushed by graph problems and pushing map/reduce to do machine learning or classifiers. You can do both, but it's not as good as a system built with these tasks in mind. "People tend to push things surprisingly far in directions where it doesn't make sense" he added. Kurt pointed out that these boundary pushing solutions often move from working to failing very quickly at an arbitrary point of the data growth.
Featuritis and Fine Features
Slava and Kurt moved on from there to discuss how features get added to a database. Noting that the RethinkDB development has seen them learn and then solve the vast array of customer needs, Slava is more aware than ever of how careful one has to be adding features. He picks out the MongoDB model of feature expansion as an example of bad practice in that the developers often add features which work well on a single node but don't scale out well - "We try not to do that".
An example of a problematic feature given by Kurt was MongoDB's unique indexes which are difficult to shard whatever. Features that work on one node are very convenient, but for RethinkDB, if it works on one node, it'll work on many nodes. One example of this is changefeeds which allows a client to subscribe to changes on a table – simple to implement on a single node, very difficult on multiple nodes – which, says Slava, was a real challenge but worth doing properly even if it took "way longer". "Following changes in data is something everyone wants from their databases now" notes Kurt, picking out Elasticsearch and Redis as databases it is hard to do on. Slava says among the changes coming to changefeeds on RethinkDB are updates based on changes in aggregation results - that feature should land before Thanksgiving.
Kurt and Slava finished the chat discussing where Hadoop fits into either company's plans and how Compose hopes to have Transporter talking to other people's databases in the future.
And that's the first ComposeCast Beta release. Remember you can directly download this episode or subscribe in your preferred Podcast player to get future episodes.