Getting started with Elasticsearch and Node.js - Part 3

So far in this series of articles we've been looking at the constituencies dataset and how we can control the way Elasticsearch indexes our data so it works for us.

In part 1 we created our Compose Elasticsearch deployment, then indexed and searched our first documents. In part 2 we started looking at mappings and datatypes.

We'll be exploring these some more in this article using the larger petitions dataset. Before we index it, though, we need to introduce one or two more advanced topics, so let's take a look at those before we begin.

Nested datatypes

The constituencies dataset formed a relatively straightforward set of data. Each document described a constituency, and consisted of a number of fields, each of which had a value. Most were simple strings, two were integers, and the only mapping we had to explicitly give Elasticsearch was for the constituencyname field (although we added some others to reduce the size of some numerical fields as good practice).

The petitions dataset is a bit more complex. As well as text and numeric fields, petitions also include two fields that contain nested data, where the content of the field is not just a piece of data, but a new dataset.

For example, if you look at the petition "Abolish road tax on Motorcycles" you'll find entries in the signatures_by_constituency field like this:

{

  "name": "Central Suffolk and North Ipswich",
  "ons_code": "E14000624",
  "mp": "Dr Poulter MP",
  "signature_count": 43

}

These fields are nested, and they present certain difficulties when it comes to comparing them using Elasticsearch. If you were to index this document as a new type - say, petitions - all the data would be stored in the index, but because of the way the inverted index is created some of the relationships between the fields would be lost. You would know that some people in the constituency of Central Suffolk and North Ipswich signed the petition, and you would know that in at least one constituency the petitions was signed by 43 people, but you wouldn't know that those 43 people were from the Central Suffolk and North Ipswich constituency.

There's a pretty good explanation of this in the Elasticsearch docs using comments on a blog article as an example.

Mapping nested objects

Losing this relationship presents a problem for us because the application we're working towards is designed to answer the question: "Which petitions are most important to people in my constituency?". In other words, for any given constituency we need to be able to check the index to find out which petitions have the most signatures. The way Elasticsearch is designed to create indexes out of the box doesn't allow us to answer this question.

What we need to do is tell Elasticsearch to index the fields as nested objects. This is how we ensure that the relationship between each constituency and the number of people who signed a petition in that constituency is preserved. To do this we simply set type as 'nested' instead of 'string', 'integer' etc. We'll also set the properties of the nested fields, defining the name fields inside signatures_by_constituency and signatures_by_country as 'not_analyzed' just as we did for the constituencyname field in part 2.

So, in mappings.js, add the following:

client.indices.putMapping({  
  index: 'gov',
  type: 'petitions',
  body: {
    properties: {
      'signatures_by_constituency': {
        'type': 'nested',
        properties: {
          'name': {
            'type': 'string',
            'index': 'not_analyzed'
          }
        }
      },
      'signatures_by_country': {
        'type': 'nested',
        properties: {
          'name': {
            'type': 'string',
            'index': 'not_analyzed'
          }
        }
      }
    }
  }
},function(err,resp){
    if (err) {
      console.log(err);
    }
    else {
      console.log(resp);
    }
});

Also add this to getmappings.js:

client.indices.getMapping({  
    index: 'gov',
    type: 'petitions',
  },
function (error,response) {  
    if (error){
      console.log(error.message);
    }
    else {
      console.log('Mappings:\n',response.gov.mappings.petitions.properties);
    }
});

Run mappings.js and then getmappings.js and you'll see the details of your new mappings for the petitions type.

Mappings:  
 { signatures_by_constituency: { type: 'nested', properties: { name: [Object] } },
  signatures_by_country: { type: 'nested', properties: { name: [Object] } } }

The petitions dataset

The petitions data can be found at the UK Government petitions website. We need to gather all the relevant data from the thousands of petitions that have been submitted, and then index all the petitions using the bulk operator like we did for the constituency data.

It's a two-stage process: first we get the data and save it as a json file, then we index the contents of that file. If you're not too bothered about having completely up-to-date data you can skip the first part ("Forming the json") and instead get the data by downloading the petitions json (petitions.json) from the petitioneering repo on Github, before heading straight to "Indexing the petitions".

Forming the json

The starting point for getting our json together is at https://petition.parliament.uk/petitions.json. We're going to read this file, and then read each petition it lists one by one. When we get to the end of this page, we'll move onto the page specified by links->next and repeat the process until we reach the last page of petitions, which doesn't specify a value for links->next.

Each page covers 50 petitions, and each of those has its own link (for example, petition #108072), which we can get from data->links->self. Each petition contains various pieces of data, such as the title of the petition (data->attributes->action), the government response (data->attributes->government_response) and the data that we're most interested in, which is the signatures, listed by parliamentary constituency (data->attributes->signatures_by_constituency).

Note that we're mostly just taking a subset of the fields from the petition and creating our own document from them, but we are also adding an importance field. What this does is measure the number of people in each constituency who signed a given petition relative to the overall number of signatures for that petition. A petition that gets 100 signatures in the Ipswich constituency out of a total of one thousand signatures is more important to Ipswich than one that gets 100 out of ten thousand total signatures.

We're using the petition number as the document id in our Elasticsearch index so we can update the index later with new petition data without having to start all over again. We could also check timestamps and only modify or add petitions that have been recently updated.

Save this code as petitions_get.js:

var getJSON = require('get-json');  
var fs = require('fs');  
var client = require('./connection.js')

var bulk = [];

var firstfile = "https://petition.parliament.uk/petitions.json";

var handleError = function(err) {  
  if(!err) return false;
  console.log(err);
  return true;
};

var processMany = function(petitions,callback) {

  for(var increment in petitions.data){
    getJSON('https://petition.parliament.uk/petitions/'+petitions.data[increment].id+'.json', function(error, response){
      if(handleError(error)) {
        // handle errors here
      }
      else {
        var preCon = response.data.attributes.signatures_by_constituency;

        for(var i = 0; i < preCon.length; i++) {
          response.data.attributes.signatures_by_constituency[i].importance = response.data.attributes.signatures_by_constituency[i].signature_count / response.data.attributes.signature_count;
        }

        var datetime = new Date();
        var indexbody = response.data.attributes;
        indexbody.tstamp = datetime;
        bulk.push(
          { index: {_index: 'gov', _type: 'petitions', _id: response.data.id } },
          { 'self': response.links.self,
            'action': response.data.attributes.action,
            'state': response.data.attributes.state,
            'background': response.data.attributes.background,
            'signatures_by_constituency': response.data.attributes.signatures_by_constituency,
            'signature_count': response.data.attributes.signature_count}
          );
      }
    })
  };
  callback(bulk);
};

var processList = function(thisfile,callback){  
  getJSON(thisfile,function(error,response){
    if(handleError(error)) {
    }
    else {
      console.log('gathering: '+response.links.self);
      processMany(response,function(response){
      });
      if (response.links.next){
        processList(response.links.next,function(response){
          callback(response);
        })
      }
      else {
        callback(bulk);
      }
    }
  })
};

var savejson = function(petitionlist,callback){  
  console.log('saving '+(petitionlist.length/2)+' petitions');
  fs.writeFile('petitions.json', JSON.stringify(petitionlist), function (err) {
    if (err) return console.log(err);
    callback('done');
  });

};

processList(firstfile,function(response){  
  console.log('petitions: '+(response.length/2));
  savejson(response,function(response){
    console.log(response);
  });
});

Run petitions_get.js and you should end up with a rather large json file containing around ten thousand (at the time of publication) signatures.

Indexing the petitions

The next step is to add these petitions to our gov index. There's nothing new here in terms of Elasticsearch - again we're using the bulk operator to index documents, but we're processing them in batches of 250 petitions to reduce timeouts and prevent bottlenecks. (The splice instruction specifies a length of 500, but remember that each document you add using bulk consists of two objects: one for the index, type and id, and one for the content). Save the following as petitions_add.js:

var client = require('./connection.js');

var bulk = require("./petitions.json");

var handleError = function(err) {  
  if(!err) return false;
  console.log(err);
  return true;
};

var indexall = function(petitionlist,callback){  
  console.log('items left to index: '+(petitionlist.length/2));
  segment = petitionlist.splice(0,500);
  if (segment.length){
    bulkindex(segment,function(response){
      indexall(petitionlist,callback);
      callback(response);
    })
  }
  else {
    callback('No more petitions to index');
  }
}

var bulkindex = function(segment,callback){  
  client.bulk({
    index: 'gov',
    type: 'petitions',
    body: segment
  },function(err,resp){
    if (err) {
      console.log(err);
      callback(err);
    }
    else {
      console.log('items',resp.items.length);
      setTimeout(function() { callback('Indexed '+resp.items.length+' items'); }, 2000);
    }
  })
}

indexall(bulk,function(response){  
  console.log(response);
});

Run petitions_add.js to index the petitions.

We can quickly check that the petitions have been indexed by adding a few lines to info.js and running it.

client.count({index: 'gov',type: 'petitions'},function(err,resp,status) {  
  console.log('petitions',resp);
});

We can also easily show that our indexed petitions have some content by modifying search.js from the previous article.

type: 'petitions',  
  body: {
    query: {
      match: { 'action': 'Ipswich' }
    },
  }

At the time of publication that should get you one hit for petition #126939) - "To Re-name the Sky Bet Championship the Ipswich Town Level League", which demonstrates both that our indexing has worked and that some petitions are more serious than others.

A brief introduction to queries

We can illustrate a few concepts of querying in Elasticsearch by modifying our existing search.

Let's run a simple query against our petitions, looking for any that have the word "government" in the title.

Create a new file query.js with the following:

var client = require('./connection.js');

client.search({  
  index: 'gov',
  type: 'petitions',
  fields: ['action','signature_count'],
  size: 5,
  body: {
    query: {
      match: { 'action': 'government' }
    },
  }
},function (error, response,status) {
    if (error){
      console.log('search error: '+error)
    }
    else {
      console.log('--- Response ---');
      console.log('Total hits: ',response.hits.total);
      console.log('--- Hits ---');
      response.hits.hits.forEach(function(hit){
        console.log(hit);
      })
    }
});

By default Elasticsearch returns the complete document source in query results, but here we've added the fields parameter to our query, which tells Elasticsearch to only return the action and signature_count fields. We've also limited the results to 5 hits instead of the default of 10 hits.

Run query.js and you'll see the total number of hits and the titles of the top 5 results. By top 5 here we mean the petitions with the highest Elasticsearch score, because that is its default sort for results.

For an in-depth look at scoring, read our article How scoring works in Elasticsearch. In the context of our search the petitions with the highest scores are the ones where the word NHS appeared most prominently in the action field.

Looking at our results we can see that we've got a few petitions with very low signature counts. Let's change our query so that only petitions with at least ten thousand signatures are returned.

For this we'll need to use a range query in addition to our existing match query. Now we have what's known as a compound query so we'll put both parts inside a bool query. It looks better than it sounds:

body: {  
    query: {
      bool: {
        must: [
          { match: { 'action': 'government' } },
          { range : {
                'signature_count' : {
                    'gte' : 10000
                }
            }
          }
        ]
      }
    }
  }

Update query.js with this new query and run it. You'll see there are fewer overall hits, thanks to our signature count limitation.

At this point it might look like the hits have been sorted by the signature_count field, but on this occasion it's just pure coincidence that the Elasticsearch scores correlate pretty well with the number of signatures. We can show this by adding in a sort that does just that:

query: {  
   [...]
},
sort: {  
      'signature_count': {
        order: 'desc'
      }
    }

This time when you run query.js you'll see the results sorted according to the value of signature_count.

For a more detailed look at some of these concepts and more, take a look at our article Elasticsearch.pm - Part 4: Querying and Search Options.

Next

In the next article in the series we'll look at the nested fields in our documents, and how we can run queries on those and sort the results. Once we've done that we'll be one step closer towards turning it all into a web app.