Getting started with Elasticsearch and Node.js - Part 2

In the first article in this series we created a Compose Elasticsearch deployment, created an index, added some documents and took our first look at using Elasticsearch to search in those documents. At the end of the article we left you with a question - why did a query that looked like it was trying to match "Ipswich North" return hits that didn't match the search phrase?

The answer lies in how Elasticsearch handles text in fields as they are added to the index and made ready for searching. And that's what this article will look at.

Mappings

Unless we tell it otherwise, Elasticsearch treats the text in our constituencyname field as a string, and passes it through an analyzer before it is indexed. The same analyzer is then used on the query string in your search. Without going into too much detail at this stage, what's happening is that Elasticsearch is converting the string into a series of tokens - where typically each word in the string becomes a token - and filtering out what are known as stop words - words that might appear frequently in a text but which aren't important or significant in any way. In English typically this means words such as prepositions, conjunctions and articles.

In terms of our constituencyname field this means that for the 'Central Suffolk and North Ipswich' document, we're left with a search index containing entries for four words: 'Central'; 'Suffolk'; 'North'; 'Ipswich'. Likewise, our search term is analyzed and Elasticsearch determines that we are looking for entries containing the word 'North', the word 'Ipswich', or both words. It then ranks all the positive results according to how northy and ipswichy they are. For example, if there was a constituency with the exact name of North Ipswich, it would get the highest score, for example, because it contains all the words in our search term and no words that are not in our search term.

We don't need to go into the details of how scoring works in this article, but if you want to learn more you'll find an excellent assessment in another of our articles, How scoring works in Elasticsearch.

To search for exact terms we need to tell Elasticsearch to not analyze the constituencyname field. To do this we need to define mappings. In Elasticsearch, mappings say what sort of data you are indexing. If you don't specify any mappings, when Elasticsearch encounters a field it hasn't indexed before it will do its best to figure out what kind of data the field contains (this is called 'dynamic mapping'), and will select accordingly from one of its core field datatypes.

Defining mappings

Let's tell Elasticsearch not to analyze the constituencyname field in our constituencies type. To do this, we define index as 'not_analyzed'. When setting the index attribute, type is a required attribute, so we will set that to 'string' since we expect the field to contain text strings. Create a new file and add the following:

var client = require('./connection.js');

client.indices.putMapping({  
  index: 'gov',
  type: 'constituencies',
  body: {
    properties: {
      'constituencyname': {
        'type': 'string', // type is a required attribute if index is specified
        'index': 'not_analyzed'
      },
    }
  }
},function(err,resp,status){
    if (err) {
      console.log(err);
    }
    else {
      console.log(resp);
    }
});

What this says to Elasticsearch is 'treat the constituencyname field as a string, but don't analyze it'. Save the file as mappings.js and run it.

Oops. You should see a nasty looking RemoteTransportException error which includes some details of problems with the constituencyname field:

MergeMappingException[Merge failed with failures {[mapper [constituencyname] has different index values, mapper [constituencyname] has different tokenize values, mapper [constituencyname] has different index_analyzer]}]  

The problem is that we have already defined a mapping for this field, or rather we've already let Elasticsearch define one for us when we indexed the constituencies data in part 1. In Elasticsearch you can't simply redefine a mapping for an existing field, but you can see what mappings Elasticsearch has defined with getMapping. Create a new file, getmappings.js, and add the following:

var client = require('./connection.js');

client.indices.getMapping({  
    index: 'gov',
    type: 'constituencies',
  },
function (error,response) {  
    if (error){
      console.log(error.message);
    }
    else {
      console.log("Mappings:\n",response.gov.mappings.constituencies.properties);
    }
});

Run getmappings.js and Elasticsearch will return information on the existing mappings for your fields:

Mappings:  
 { ConstituencyID: { type: 'string' },
  ConstituencyName: { type: 'string' },
  ConstituencyType: { type: 'string' },
  Electorate: { type: 'long' },
  ValidVotes: { type: 'long' },
  constituencyID: { type: 'string' },
  constituencyname: { type: 'string' },
  constituencytype: { type: 'string' },
  country: { type: 'string' },
  county: { type: 'string' },
  electorate: { type: 'long' },
  region: { type: 'string' },
  regionID: { type: 'string' },
  validvotes: { type: 'long' } }

We can see that most of the fields were indexed as strings, which makes sense, while electorate and validvotes were indexed as long, which is the default field type Elasticsearch chooses for whole numbers.

To change how these fields are indexed we need to delete the index and start again. This does unfortunately mean undoing some of our work from part 1, but it helps to emphasise the importance of planning and testing before indexing documents. With the right mappings in place our documents will be ready for us to do exactly what we have planned for later.

To start over run delete.js, which you created in part 1, to clear out the constituencies type, then run create.js to recreate the index, and ready yourself for some mapping work.

So let's have another go at defining the mappings. Before we do that, though, do we really want those numeric fields stored as long? Not that it will make a significant difference, to our index perhaps, but it seems excessive when we're storing numbers corresponding to the size of a parliamentary constituency. Let's store those as integers instead (the short type isn't quite large enough). Add two more mappings underneath constituencyname like so:

'electorate': {  
  'type': 'integer'
},
'validvotes': {  
  'type': 'integer'
}

Now run mappings.js again, and then take a look at your new constituencyname mapping with getmappings.js. If all is well you'll get a response containing the mappings you've just defined:

Mappings:  
 { constituencyname: { type: 'string', index: 'not_analyzed' },
  electorate: { type: 'integer' },
  validvotes: { type: 'integer' } }

Finally, run constituencies.js again to re-index your documents with their new mappings. (If you want to, you can check your document count again by running info.js).

Searching non-analyzed fields

Let's run our search from Part 1 again, using search.js, and setting the search term as "North Ipswich". How many hits do we get now? A big fat zero is how many, because we are now looking for fields that exactly match the search phrase. There is no constituency of North Ipswich, so Elasticsearch returns zero results. We need to search for "Central Suffolk and North Ipswich" to return a hit for the relevant constituency.

Now let's search for "Ipswich". We will expect to get one hit for this search. If we do our mappings have worked as intended. (If this search doesn't return a hit, use info.js to check the status of your Elasticsearch deployment, and try recreating the index and mappings again if you see any problems).

It might seem quite unusual to define what is a text field in such a way that it requires an exact match; what we've effectively achieved is to make constituencyname an identifier. Now, this might seem odd when we already have a potential identifier field in ConstituencyID, but we're doing this because later in this series we're going to connect our constituencies data to a postcode lookup service. The lookup will return a constituency name as a result, which we'll match against the constituencies in our index.

One thing we won't be able to do now is allow users to perform text searches on the constituencyname field. That's OK, because that's not a use-case for the application we're working towards in these articles. Before you start mapping fields as not_analyzed, think about whether you will want to run text searches against those fields: if so, it's better to let Elasticsearch analyze the field.

Next

In the next article in the series we'll obtain and index the petitions themselves, and we'll look at some more advanced mapping concepts.