Elasticsearch at Compose - How It Fits

As you will see, we've not only changed our name but we've also launched the first of our new hosted database technologies, Elasticsearch. When we at Compose considered what would be joining MongoDB on our roster of databases, we were looking to ensure our customers had all the tools they needed. So why Elasticsearch? Let's talk about MongoDB first and show you why.

Where MongoDB works

MongoDB is great for the majority of the work needed to power a modern application's data storage needs. But, as with all systems, the demands on that system evolve, whether it be from changes in your applications or new requirements from users, and it is always appropriate to review how well your technology stack matches those demands.

Take as an example a company who has developed their application stack based around MongoDB so that it is a fast and effective platform which is powering the company's growth. At its core, the data stored is a rich warehouse of how the company's products are performing and the company's own interactions with its customers . So much data that within the company people are actively exploring the data, looking for new correlations, patterns and techniques to extract more value from it.

The problem is that for these explorers to be effective they need fast responses to database spanning, complex queries. The application the company had developed used MongoDB as a fast document store with blazing retrieval speeds and incoming data. The architecture was simple with indexes on order numbers, customer names and other simple obvious keys.

The query demands from the company's data explorers were, though, more nebulous and ideally, for those particular users, they would have wanted every field indexed so they could search through everything. But indexing everything wasn't going to be an option; MongoDB needs to keep indexes in memory, alongside the working set, to ensure performance. The more indexes there are, the more memory is consumed and, although you can build a machine with terabytes of RAM, it's a very big investment. There's also the other effects that indexing everything (or as much as possible) has – inserts become a lot more work as they have to be indexed too and even with background indexing, the work still has to be done.

At this point, its worth re-evaluating the problem and the stack. At its heart, the data explorers want something with the nature of a search engine, built to index all data generically and, where its told it can, to derive more semantic value from particular fields such as numbers, dates and so on. Search engines handle queries by scoring possible results, rather than boolean filtering, and it's this scoring mechanism which allows for deeper dives into the dataset by way of near matches and fuzzier comparisons.

Bringing in Elasticsearch

By adding search engine technology to the stack, it is not only possible to isolate the complex opportunistic queries of the data explorers from the production database, it is a chance to offer those users who need to extract new value from the data set a richer, more nuanced way to explore. And thats where Elasticsearch enters our database architecture.

Elasticsearch saw its first release back in 2010, created by Shay Banon as a scalable search engine replacing his previous search project Compass. It's built using the Lucene search libraries used by many other search engine projects like Apache's Solr, but Elasticsearch's design focus is around searching JSON documents for applications. It's worth noting that Elasticsearch's terminology is search centric, so an index is a collection of JSON documents. As documents are added to the index, a dynamic type is created for the index which is automatically updated with any new fields and implied types of those fields; this is the "schema-free" element of Elasticsearch.

String fields are fed, by default into full text analysers to create inverted indexes while apparent numbers, dates and booleans will all be indexed in their own way. A simple query of the index for a term will return all matches in all fields, including synonym matches and stem matches for that term. And by default, you'll only get the first ten matches based on relevance. If you had an index of tweets and searched it for "Bob", you'd get matches for a user called Bob, messages referring to Bob and so on. But that isn't the end of the search.

All queries for Elasticsearch are based around a scoring metric and can look at anything from simple matching of single or multiple fields, fuzzy matching, spans of text and regular expressions to geographic points, ranges and more complex boolean combinations where matches must be, or not be, found. Each returned result comes complete with a relevance score as to how well it matched the query. There's a Query DSL for constructing queries, along with filtering to cut down the records searched, faceted search and an aggregation framework for summarising data values.

This is only the start of Elasticsearch's capabilities. The default handling of the field values can be overridden to create more "types" and form what could be viewed as a table in traditional database schema terms. This includes selecting how full text fields are analyzed including no analysis at all (forcing matches to be exact), what tokenizers should be used, right down to the basics of is the field numeric or string or some other value. This in turn means a document could be searched for and presented to the user through a number of mapped types and the index's own dynamic type depending on what the focus of a search is.

Elasticsearch's ability to apply different mappings also makes it a great option for log analysis too. By being able to apply different mappings to each log entry, the system is able to more intelligently search, analyse and aggregate log data without passing the load to an external application or performing a complex extract operation to another table or database.

There's a lot of power in Elasticsearch, but with great power comes some caveats. Most notable is that, out of the box, there's no shell or friendly front end to Elasticsearch. You typically interact with Elasticsearch through a REST/JSON API, which is ideal for applications but less ideal for command-line interaction. Thats why Compose offers a number of Elasticsearch plugins to make interacting with the database easier – Elastic HQ, which offers a rich interface to management and query construction, and Kibana, a web based interface that allows for the creation of analytics dashboards.

Composed Elasticsearch

3

What we've announced this week is that Elasticsearch is now available from Compose as a public beta. That means we've already developed the solution and tested it with selected customers and that we are coming to the end of our development cycle. This means data placed on Elasticsearch should be safe and most, if not all features, are stable though we may take the opportunity to enhance them.

Those features include a 3 node architecture using SSD storage with automatic backups, monitored instances and a full UI that allows you to easily manage your Elasticsearch deployments. Our beta service is ready for limited production use in the run up to a full release later in the year. Pricing starts at $54 for 2GB of storage and costs $18 for every extra gigabyte each of which also comes with another 100MB of RAM, and we can scale that all the way up to 100GB of storage and 10GB of RAM for $1818.

As you can see, Elasticsearch is a perfect fit for Compose expanding the database infrastructure we offer to let you focus on delivering great apps. Elasticsearch could be ideal for taking on your applications's complex search-oriented tasks too. So talk to us today to see how it could fit into your stack or sign up now to experience it for yourself.