Porting a MongoDB Full Text Search App to Elasticsearch

A few weeks ago we showed how you could use MongoDB's full text search facility to index and search a field with scored results. As we've just launched Elasticsearch, we thought porting the simple web application created in that article to Elasticsearch would be a good place to start. So, let's dive in and get porting. If you want to follow along, copy the code from the MongoDB example at ftslite into a clean directory and we can begin.

The first thing we need is an Elasticsearch library for Node.js; the good folks at Elasticsearch produce elasticsearch.js which is an full implementation of the Elasticsearch API. You can install it with npm install elasticsearch --save in your applications directory. With that in place, we can open index.js for editing.

The first change we'll make is in the require section. Where the application requires MongoDB...

var mongodb = require('mongodb'),  
  MongoClient = mongodb.MongoClient;

... we will replace it with a require for the elasticsearch package:

var elasticsearch = require('elasticsearch'),  
  esClient = new elasticsearch.Client({ host: process.env.ES_URL } );

This is slightly different as it sets up the client with the connection URL, which we pull from the environment variable ES_URL. In the original code, the connection is made later in the code, but with Elasticsearch, because of the REST API, there is no persistent connection and connections to the database are made on every call to the API. That ES_URL can be obtained from the Elasticsearch Overview dashboard, in the Connect Strings section. The username and password are generated under the users tab. If your new to Elasticsearch, you'll find that the database itself has no user-based security but is in fact protected by a proxy. The username and password are actuallyfor that proxy rather than the database. Do remember to set the ES_URL variable before running the application.

Start it up

The next change we need to make is connecting to the database and checking that our collection exists. We don't need to make a persistent connection for Elasticsearch so we can skip that. We do, though, need to check if what Elasticsearch calls an index, roughly analagous to a collection, exists. For that we call on the API's indices.exists() method...

esClient.indices.exists({  
  index: "textstore"
}, function(err, response, status) {

The first argument is a hash of parameters, in this case one parameter, the name of the index. That's a parameter which is specific to this particular API request, but there are also generic parameters common to all calls in the API. Another thing that's common across the API is what is passed to callbacks, specifically an error code (if there was one), the HTTP response and the HTTP status code. It's the status code we're interested in here because if the index doesn't exist, it will be set to 404 and if it does, it will be 200...

if (status == 404) {  
    esClient.indices.create({
      index: "textstore"
    }, function(err, response, status) {
      if(err) {
console.log(err)  
process.exit(1)  
      }
      console.log("Created index "+response);
    });
  }
  app.listen(3000);

If the index doesn't exist, we call indices.create() passing it a hash with the index's name set. In the callback for that, we check the returned error code and if one is set we print it and exit. Otherwise, we announce the index has been created. Finally, everything gets to start the Express web server listening.

Adding a document

Next up, the code for adding a document. In the example we used a single text area in a form to gather a block of text which we would place in the MongoDB collections free text search indexed field. For Elasticsearch we don't have that one field limit. The whole record is indexed for search while the searching can be optimised by adding information about the type of data in a particular field. This makes adding our field pretty simple when the user posts a form:

app.post("/add", function(req, res) {  
  esClient.create({
    index: 'textstore',
    type: 'basicdoc',
    body: {
      document: req.body.newDocument,
      created: new Date()
    }
  }, function(err, result,status) {
    if (err == null) {
      res.sendfile("./views/add.html");
    } else {
      res.send("Error:" + err);
    }
  });
});

The meat of the method is call to create(). Again, it takes a hash of parameters, with the first parameter index being the the name of the index in which we want to create the document. We then give a type to the document which, in this case, will be dynamically mapped by Elasticsearch. For now, we'll call our type "basicdoc". The next parameter in the hash is the body, the document itself. This is a JSON document and we'll place the text content in a field called document and copy the date into a field called created. The body parameter is a generic parameter in the JavaScript API and can be found being used by many API methods. The contents of the body parameter are sent to the server as the data content of a POST request being made.

Things we aren't setting include the document id which will be generated by Elasticsearch because this call, by default, uses the HTTP POST method at the REST API level. The REST API uses the HTTP method to discriminate - PUTs for operations with an id and POSTs for those without.

In the callback to the create, if there's no error, we go back to the add.html page otherwise we display the error. And that's it for handling the add. Now we can do the searching.

Searching for terms

We use another simple form here to gather up a set of search terms which are then POSTed. We'll pick up the code where it handles that post.

app.post("/search", function(req, res) {  
  esClient.search({
      index: 'textstore',
      body: {
query: {  
  match: {
    document: req.body.query
  }
}
      }
    },
    function(err, response, status) {
      res.send(pagelist(response.hits.hits));
    });
});

As you have probably guessed, searching involves the search() API method and again we are passing it a hash of parameters. Despite appearances, there are only two parameters being passed, the index and the body. In this case, the body is a JSON document defining the search. We are running the simplest possible search, just looking for the best scoring matches with the document field, so it's a query: looking for match: in the document: with the query text from the form. The query uses a rich DSL which has numerous options to control the search. Just the match operator includes options for tuning how fuzzy the search is, how the text will by analysed, set minimums or frequency cut offs for matching or force the query terms to be handled as a phrase. You may notice that we've specified the document field in the query. If we wanted to search all the fields in the stored JSON document, we could change the field name to _all.

When we get the result in the callback to the search there's a lot of data to work through. Here's an example:

{
  took: 6,
  timed_out: false,
  _shards: {
    total: 5,
    successful: 5,
    failed: 0
  },
  hits: {
    total: 1,
    max_score: 0.109375,
    hits: [{
      _index: 'textstore',
      _type: 'basicdoc',
      _id: 'VQGu5g0vTOub1W9jw_ONFg',
      _score: 0.109375,
      _source: {
document: 'Sometimes ..... the problem.rn',  
created: '2014-08-04T10:48:32.316Z'  
      }
    }]
  }
}

This response starts with took, the milliseconds it took to execute the search, timed_out, which tells us if the search timed out and _shards, which reports on how the query ran across the database's shards. Then comes the hits hash which contains some more statistics (the total number of results and the highest matching score in the results) and finally a hits array which carries those matching records. It's this array we pass on to the page formatting function with pagelist(response.hits.hits).

function pagelist(items) {  
  result = "<html><body><ul>";
  items.forEach(function(item) {
    itemstring = "<li>" + item._id +
      "<ul><li>" + item._score +
      "</li><li>" + item._source.created +
      "</li><li>" + item._source.document +
      "</li></ul></li>";
    result = result + itemstring;
  });
  result = result + "</ul></body></html>";
  return result;
}

That function merely steps through the hits and writes out some HTML. Each hit includes _index, the index it came from and _type, the document's type, because searches can cross over indexes and types. It then carries the _id of the document and it's _score in the matching. Finally, there's the _source hash which carries the fields requested from the record; in this case, all of them, document and created. We just print the id, score, created and the content of the document.

If you run this application and put a few dozen documents into it, you'll notice you only get ten results maximum from any query. That's because in the query we omitted from and size parameters which allow for pagination through data but also default to 0 and 10 respectively.

Wrapping up

With the changes in place and your ES_URL environment variable correcty set, running node index.js should bring up a server on localhost port 3000 where you can add and search documents. If you didn't follow along, the code is available from a github repository. We were only porting a MongoDB full text search application here so we've barely touched Elasticsearch's type mapping and search capabilities.

You should see that although there's a rich Elasticsearch REST API is wrapped up with the Elasticsearch JavaScript libraries, it is worth being familiar with both. The abstraction of that REST API is leaky, HTTP elements like status values can be returned where you may expect true/false or other enumerations so regard the REST API documentation as the primary source of truth for all of the various libraries. Knowing that, your journey with Elasticsearch and Node.js will become a lot easier as you start to harness the search power of Elasticsearch.