Getting started with Elasticsearch and Node.js - Part 4

In the previous article in this series we indexed the petitions to go with the constituencies data that we worked with in the earlier articles, and took a brief look at running a few queries on the petitions.

In this article we're going to run some queries on our nested fields. Nested queries are a powerful but tricky aspect of Elasticsearch. They allow us to explore more complex datasets; setting them up correctly requires a bit more effort than the queries we've run so far.

To make life a little easier when it comes to running different searches we're going to start passing arguments to our Node files. We can use the yargs library for this.

npm install yargs  

Yargs takes supplied arguments and puts them in an hash called argv, which we can then use to pass arguments to our various functions. Create a new file, nestedQuery.js, with the following:

var client=require ('./connection.js');  
var argv = require('yargs').argv;  

Let's set up our query. We'll use the search from query.js that we created in part 3 as the basis for our new nested query. Instead of matching a keyword as we did in part 3, however, we'll limit our search to all open petitions, which we can do with match: { 'state': 'open' }.

client.search({  
  index: 'gov',
  type: 'petitions',
  fields: ['action','signature_count'],
  body: {
    query: {
      bool: {
        must: [
          { match: { 'state': 'open' } },
          { range : {
                'signature_count' : {
                    'gte' : 10000
                }
            }
          }
        ]
      }
    }
})

Now we need to add the nested query itself. We want to return results that correspond to a constituency name, which we'll pass as a value when we run nestedQuery.js. The nested part of our query looks like this:

nested: {  
  path: 'signatures_by_constituency',
  query: {
    bool: {
      must: [
        { 'match': { 'signatures_by_constituency.name': constitLookup }}
      ]
    }
  }
}

First we have to specify the path to the nested field, then inside that we set up another query like the one we already have. We use dot notation to provide the path to the nested field we want to query. In our case that's signatures_by_constituency.name. Finally we pass in the match term constitLookup, which we'll provide as an argument when we run nestedQuery.js.

Putting the two parts together, our full query looks like this:

var results = function(constitLookup) {  
  client.search({
    index: 'gov',
    type: 'petitions',
    fields: ['action','signature_count'],
    body: {
      query: {
        bool: {
          must: [
            { match: { 'state': 'open' }},
            { range : {
                  'signature_count' : {
                      'gte' : 10000
                  }
              }
            },
            { nested: {
              path: 'signatures_by_constituency',
              query: {
                bool: {
                  must: [
                    { 'match': { 'signatures_by_constituency.name': constitLookup }}
                  ]
                }
              }
            }}
          ]
        }
      }
    }
  },function (error, response,status) {
      if (error){
        console.log("search error: "+error)
      }
      else {
        response.hits.hits.forEach(function(hit){
          console.log(hit);
        })
      }
    });
}

Copy that into nestedQuery.js. We'll need something to pass our search term in and display the results, so add this to the end of nestedQuery.js:

if (argv.search) {  
  var constitLookup=argv.search;
  console.log("Search term: "+constitLookup);
  results(constitLookup);
}

To test our query we can run it as follows:

node nestedQuery --search="Ipswich"  

You'll get a list of results including the index, type, id and score for each document in the results, together with the contents of the action and signature_count fields that we specified:

{ _index: 'gov',
  _type: 'petitions',
  _id: '120753',
  _score: 10.085289,
  fields:
   { action: [ 'Parliament to sit on Saturdays which should be a "normal working day" for MPs.' ],
     signature_count: [ 99802 ] } }

So far, so good, but we haven't yet told Elasticsearch what we want it to do with the results, so the query is currently just looking for open petitions with more than ten thousand signatures in total that were signed in the constituency we're looking for. Run this query and you'll just get a list of petitions that have been signed in whichever constituency is passed as an argument to nestedQuery.js - in our case it was 'Ipswich'. We need to sort our results according to the number of signatures in the constituency specified by constitLookup.

Sorting the nested query

If we want to, we can sort our results just as we did when we were searching on non-nested fields. To sort our results in descending order of the number of signatures on the petition we could add this sort to our query, as we did in part 3.

query: {  
  ...
},
sort: {  
  'signature_count': {
    order: 'desc'
  }
}

However, at this point it's probable that we'd get a very similar set of results if we didn't specify a constituency name to search for, because once a petition has ten thousand signatures it's overwhelmingly likely that it has been signed at least once in every one of the 650 constituencies in our dataset.

We can make our nested search work a bit harder for us by removing the requirement that signature_count must be at least ten thousand and changing the sort order from descending (order: 'desc') to ascending (order: 'asc').

query: {  
  bool: {
    must: [
      { match: { 'state': 'open' }},
      { nested: {
        path: 'signatures_by_constituency',
        query: {
          bool: {
            must: [
              { 'match': { 'signatures_by_constituency.name': constitLookup }}
            ]
          }
        }
      }}
    ]
  }
},
sort: {  
  'signature_count': {
    order: 'asc'
  }
}

This gets us a list of the least signed open petitions that have been signed at least once in the constituency we're interested in. It's not as useful to us as a descending sort, although it may throw up one or two local issues. If instead we had a list of users and our data was structured like this:

{
  "name": {
    first: "Neil",
    last: "Dewhurst"
    },
  "age": 21
},
{
  "name": {
    first: "Neil",
    last: "Smith"
    },
  "age": 55
}

We could use this query to order our 'Neils' by ascending age:

query: {  
 nested: {
  path: 'name',
    query: {
      bool: {
        must: [
          { 'match': { 'name.first': 'Neil' }}
        ]
      }
    }
  }
},
sort: {  
  'age': {
    order: 'asc'
  }
}

Sorting by nested field

What we really want to do with our constituencies is sort on a nested field so that we are sorting on data that relates to the constituency given by constitLookup. For this we need to specify a nested_path and a nested_filter. First let's sort our results in descending order of the number of signatures in the constituency we're querying on.

sort: {  
  'signatures_by_constituency.signature_count' : {
    order: 'desc',
    nested_path: 'signatures_by_constituency',
    nested_filter: {
      query: {
        bool: {
          must: [
            { 'match': { 'signatures_by_constituency.name': constitLookup }}
          ]
        }
      }
    }
  }
}

Note that our nested_path here matches the path specified in the nested element of our query, and the nested_filter matches our query of that same nested element.

Replace the plain signature_count sort with this new nested sort and run nestedQuery.js again, keeping 'Ipswich' as your search term.

Finally, let's make use of that importance field we created when we indexed the petitions. We can sort on multiple fields in Elasticsearch, so we can add a new sort to our existing sort as follows to sort by importance and then signature_count:

sort: {  
  'signatures_by_constituency.importance' : {
    order: 'desc',
    nested_path: 'signatures_by_constituency',
    nested_filter: {
      query: {
        bool: {
          must: [
            { 'match': { 'signatures_by_constituency.name': constitLookup }}
          ]
        }
      }
    }
  },
  'signatures_by_constituency.signature_count' : {
    order: 'desc',
    nested_path: 'signatures_by_constituency',
    nested_filter: {
      query: {
        bool: {
          must: [
            { 'match': { 'signatures_by_constituency.name': constitLookup }}
          ]
        }
      }
    }
  }
}

Now when you run nestedQuery.js you'll get back ten petitions, listed in descending order of the value of the importance field for whatever constituency you supply as the search argument. In other words, given a constituency as an input, our output is now a list of the petitions people in that constituency are most interested in.

Now that we have two sort criteria you'll also notice that Elasticsearch returns the sort values as an array, which is handy for when you want to use those values in your output. And that's exactly what we'll do in the next and final article.

Next

In the final article in the series we'll use both the constituency and petitions data sets together and add in a postcode lookup. We'll finish by turning our existing code into a web app, which we'll deploy using IBM Bluemix.