Elasticsearch.pm - Part 4: Querying and Search Options

In this final article of our Elasticsearch.pm series, we'll finish up by looking at querying and some of the search options. Our previous articles in this series have led us through installing the Search::Elasticsearch perl module, connecting and checking our Elasticsearch instance and server cluster, indexing basics, as well as some more advanced index options and considerations.

The examples used in this article will be using the "$es" client object and the "top_films" index we previously created in the other articles in the series. As always, we recommend reviewing the official documentation to get a deeper understanding of the methods and options we'll be demonstrating. In this article, we'll focus on the Elasticsearch Search APIs. These are accessed using methods in the Search::Elasticsearch::Client::2_0::Direct sub-module.

Get Document

Probably the most simple retrieval to run is a get() on a specific document. To do this, you'll need the name of the index (in our case "top_films"), the document type ("film" for us), and the document ID. We've been using "The Matrix" in many of our examples, which was indexed as document ID 17 so let's run get() on that:

# Get a specific document
my $doc = $es->get(  
    index   => 'top_films',
    type    => 'film',
    id      => 17
);

As you can see, the syntax used above is the same for the get() document method as what we used in Part 2 of this series to index documents and in Part 3 of the series to update an existing document. In Part 1 of this series, we set up logging for our server interactions so let's check our log file for the document that gets returned to us:

[Mon Jan 11 20:20:17 2016] # Request to: https://aws-us-east-1-portal10.dblayer.com:10019
curl -XGET 'http://localhost:9200/top_films/film/17?pretty=1'

[Mon Jan 11 20:20:17 2016] # Response: 200, Took: 419 ms
# {
#    "_version" : 2,
#    "_type" : "film",
#    "_source" : {
#       "year" : "1999",
#       "type" : "feature",
#       "image" : {
#          "height" : "500",
#          "width" : "334",
#          "url" : "http://ia.media-imdb.com/images/M/MV5BMTkxNDYxOTA4M15BMl5BanBnXkFtZTgwNTk0NzQxMTE@._V1_.jpg"
#       },
#       "can_rate" : "true",
#       "rating" : "8.7",
#       "description" : "A computer hacker learns from mysterious rebels about the true nature of his reality and his role in the war against its controllers.",
#       "num_votes" : "1125526",
#       "tconst" : "tt0133093",
#       "title" : "The Matrix"
#    },
#    "found" : true,
#    "_index" : "top_films",
#    "_id" : "17"
# }

But what if you don't have the specific document ID in the index? Here's where we get into the methods for the Search APIs.

Search

Searches in Elasticsearch are run against all indexes and all document types in those indexes by default. By specifying one or more particular indexes or types, you can limit the search accordingly. When performing a search using the search() method, you can run query to find a match in one or more of the document fields in various ways and you can run aggs (for aggregations) to aggregate documents using common calculations. Let's start by looking at queries.

Queries

Before we start in on some examples, let's cover some basics. The query search function uses the Elasticsearch Query DSL. Check out the official Elasticsearch documentation to get a better understanding of some of the options we're mentioning here.

query comes in two flavors: leaf and compound. Leaf queries look for a value in a specified field. They use the match, term, and range clauses. Compound queries can be used to combine leaf queries or other compound queries. They use bool, dis_max, not, and some additional clauses for making combinations.

It is also possible to get two different kinds of results by running query. The first kind is what you'd typically expect in a search result - the documents that match and their relative scores. This kind of result is returned when you are in what's called "query context". The second kind of result returns the matched documents, but the score is a constant of 1 for those that match. This is called "filter context". It can be used when you just want so see which documents match, but you aren't concerned with relative scores. So, depending on your purposes, your queries can use one or the other context for the result type you need.

You can also use filtering with "query context". In that use case, matches in the filter are used only to narrow the search results, but the filter match is not used in the document scoring for the query. Using filters with queries where the matches are not relevant to the scoring can increase the performance of the queries.

Let's start by checking out leaf queries.

Leaf queries

Here is a simple leaf query using the match clause against the title field:

# Search for a document by matching in the title field
my $results = $es->search(  
    index => 'top_films',
    type  => 'film',
    body  => {
        query => {
            match => {title => 'godfather'}
        }
    }
);

In the example above, we're using the search() method for documents. We're limiting our search to our "top_films" index and specifically to the "film" document type within that index. The body of our search is a query using the match clause to try to find matches in the title for "godfather". Our results are exactly what we'd expect given our data set of the top 250 films:

[Mon Jan 11 21:58:21 2016] # Request to: https://aws-us-east-1-portal7.dblayer.com:10304
curl -XGET 'http://localhost:9200/top_films/film/_search?pretty=1' -d '  
{
   "query" : {
      "match" : {
         "title" : "godfather"
      }
   }
}
'

[Mon Jan 11 21:58:21 2016] # Response: 200, Took: 862 ms
# {
#    "took" : 2,
#    "timed_out" : false,
#    "hits" : {
#       "total" : 2,
#       "hits" : [
#          {
#             "_id" : "2",
#             "_type" : "film",
#             "_source" : {
#                "year" : "1972",
#                "image" : {
#                   "width" : "333",
#                   "url" : "http://ia.media-imdb.com/images/M/MV5BMjEyMjcyNDI4MF5BMl5BanBnXkFtZTcwMDA5Mzg3OA@@._V1_.jpg",
#                   "height" : "500"
#                },
#                "tconst" : "tt0068646",
#                "title" : "The Godfather",
#                "can_rate" : "true",
#                "rating" : "9.2",
#                "type" : "feature",
#                "num_votes" : "1072605"
#             },
#             "_score" : 2.6367974,
#             "_index" : "top_films"
#          },
#          {
#             "_id" : "3",
#             "_source" : {
#                "num_votes" : "727439",
#                "type" : "feature",
#                "rating" : "9",
#                "title" : "The Godfather: Part II",
#                "can_rate" : "true",
#                "year" : "1974",
#                "image" : {
#                   "url" : "http://ia.media-imdb.com/images/M/MV5BNDc2NTM3MzU1Nl5BMl5BanBnXkFtZTcwMTA5Mzg3OA@@._V1_.jpg",
#                   "height" : "500",
#                   "width" : "333"
#                },
#                "tconst" : "tt0071562"
#             },
#             "_score" : 2.1193392,
#             "_index" : "top_films",
#             "_type" : "film"
#          }
#       ],
#       "max_score" : 2.6367974
#    },
#    "_shards" : {
#       "total" : 5,
#       "failed" : 0,
#       "successful" : 5
#    }
# }

We got two "hits" (search results) for our query: "The Godfather" and "The Godfather: Part II" (note that the third Godfather film was not part of the top films data set; if it was, we would have gotten three hits based on our query). Because the documents and their relative scores were both retrieved ("The Godfather: Part II" scored lower than "The Godfather" because it has more words in the title so it's less of an exact match), you can tell we're in "query context".

We'll save more in-depth discussion of how the score is arrived at for a future article since it can be quite complex, but the other thing we can see here is that all of our 5 shards were able to find a successful match. By default all the shards are queried, but you can specify a routing parameter to indicate a specific shard if that is preferable for the kind of query you're running.

Let's look now at a compound query example.

Compound queries

Let's say we want to do the same match query for "godfather" in the title, but we also want to make sure we get the one that was released in 1974:

# Search for a document by matching year and title
my $results = $es->search(  
    index => 'top_films',
    type  => 'film',
    body  => {
        query => {
            bool => {
                must => [
                    {term => {year => '1974'}},
                    {match => {title => 'godfather'}}
                ]
            }
        }
    }
);

Here you can see that we're using the bool clause and within that, we are specifying two conditions that must have a match (additional conditions include must_not, should, and minimum_should_match, among others). Our two conditions are a title match on "godfather" (like we saw above), but now also, a term match for 1974 in the year field. A term match requires an exact match of the field value.

And, now we only get one hit back as we'd expect - "The Godfather: Part II":

[Mon Jan 11 21:58:22 2016] # Request to: https://aws-us-east-1-portal10.dblayer.com:10019
curl -XGET 'http://localhost:9200/top_films/film/_search?pretty=1' -d '  
{
   "query" : {
      "bool" : {
         "must" : [
            {
               "term" : {
                  "year" : "1974"
               }
            },
            {
               "match" : {
                  "title" : "godfather"
               }
            }
         ]
      }
   }
}
'

[Mon Jan 11 21:58:22 2016] # Response: 200, Took: 581 ms
# {
#    "took" : 9,
#    "hits" : {
#       "total" : 1,
#       "max_score" : 4.4957976,
#       "hits" : [
#          {
#             "_type" : "film",
#             "_source" : {
#                "num_votes" : "727439",
#                "type" : "feature",
#                "rating" : "9",
#                "can_rate" : "true",
#                "title" : "The Godfather: Part II",
#                "tconst" : "tt0071562",
#                "image" : {
#                   "width" : "333",
#                   "url" : "http://ia.media-imdb.com/images/M/MV5BNDc2NTM3MzU1Nl5BMl5BanBnXkFtZTcwMTA5Mzg3OA@@._V1_.jpg",
#                   "height" : "500"
#                },
#                "year" : "1974"
#             },
#             "_score" : 4.4957976,
#             "_index" : "top_films",
#             "_id" : "3"
#          }
#       ]
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "successful" : 5,
#       "total" : 5,
#       "failed" : 0
#    }
# }

Filters

If we change up our syntax a bit, we can use "filter context" instead. In this case, we are using the query inside the filter so that the whole request is using "filter context":

# Search for a document by matching title and year using filters
my $results = $es->search(  
    index => 'top_films',
    type  => 'film',
    body  => {
        filter => {
            and => [
            {term => {year => '1974'}},
            {query => {
                match => {title => 'godfather'}
            }}
            ]
        }
    }
);

In the above example, we're using the and condition to join our term filter for the year together with our query match on title. Note that because the match clause is not exact, we have to wrap it in query for a filter. The result we get in "filter context" is "The Godfather: Part II" with a score of 1 since score is not applied in filters:

[Tue Jan 12 10:52:24 2016] # Request to: https://aws-us-east-1-portal10.dblayer.com:10019
curl -XGET 'http://localhost:9200/top_films/film/_search?pretty=1' -d '  
{
   "filter" : {
      "and" : [
         {
            "term" : {
               "year" : "1974"
            }
         },
         {
            "query" : {
               "match" : {
                  "title" : "godfather"
               }
            }
         }
      ]
   }
}
'

[Tue Jan 12 10:52:24 2016] # Response: 200, Took: 210 ms
# {
#    "_shards" : {
#       "total" : 5,
#       "successful" : 5,
#       "failed" : 0
#    },
#    "timed_out" : false,
#    "took" : 21,
#    "hits" : {
#       "max_score" : 1,
#       "total" : 1,
#       "hits" : [
#          {
#             "_id" : "3",
#             "_source" : {
#                "image" : {
#                   "height" : "500",
#                   "width" : "333",
#                   "url" : "http://ia.media-imdb.com/images/M/MV5BNDc2NTM3MzU1Nl5BMl5BanBnXkFtZTcwMTA5Mzg3OA@@._V1_.jpg"
#                },
#                "tconst" : "tt0071562",
#                "type" : "feature",
#                "year" : "1974",
#                "title" : "The Godfather: Part II",
#                "can_rate" : "true",
#                "num_votes" : "727439",
#                "rating" : "9"
#             },
#             "_score" : 1,
#             "_type" : "film",
#             "_index" : "top_films"
#          }
#       ]
#    }
# }

We've only been able to review some of the more common query examples here. There are loads of options for querying, including those for full text and term matches (some of which we saw in our examples above) as well as nested and hierarchical matches. You can boost score results, use wildcards, and perform fuzzy matches, among other things.

There are also Geo-focused options, Specialized options including "more like" functionality and scripting, and Span options (positional queries for specific ordering and proximity). And, once you have your matches, you can apply various parameters to the results like sorting, re-scoring, highlighting, filtering by a minimum score requirement, post-filtering for faceted searching, index boosting, and more!

There are also some useful query utilities. For example, you can use search_exists() simply to see if you'll get any results back at all or you can run count() to check how many results your search will retrieve. If you want to learn more about the results you retrieved, explain() can help you understand why a document matched your query and got the score it did.

Now that we've got some basics under our belts for querying, let's look at aggregations.

Aggregations

Aggregations allow us to group documents and run summary calculations on them. You can find the Elasticsearch documentation on aggregations here. There are two kinds of aggregations that Elasticsearch provides: metrics and bucket. We'll have a brief look at each of these below.

Metrics

One of the most common uses for aggregations are metrics derived from the values in one or more fields. If you've used SQL, then you're probably familiar with some of the aggregate functions used in that language such as SUM and AVG. Elasticsearch allows you to perform similar aggregations to return metrics from document result sets.

Let's use the avg aggregation to find out the average rating for the top films in 1974:

# Search for average rating
my $results = $es->search(  
    index => 'top_films',
    type  => 'film',
    body  => {
        query => {
            term => {year => 1974}
        },
        aggs => {
            avg_rating => {avg => {field => rating}}
        }
    }
);

This example does a term query for the top films of 1974 and then performs an avg aggregation on the rating fields for the returned documents. The aggregation is named "avg_rating". As it turns out there were two (excellent, if I must say) top-rated films in 1974. Here's our results:

[Tue Jan 12 15:04:44 2016] # Request to: https://aws-us-east-1-portal10.dblayer.com:10019
curl -XGET 'http://localhost:9200/top_films/film/_search?pretty=1' -d '  
{
   "aggs" : {
      "avg_rating" : {
         "avg" : {
            "field" : "rating"
         }
      }
   },
   "query" : {
      "term" : {
         "year" : 1974
      }
   }
}
'

[Tue Jan 12 15:04:44 2016] # Response: 200, Took: 561 ms
# {
#    "aggregations" : {
#       "avg_rating" : {
#          "value" : 8.65000009536743
#       }
#    },
#    "_shards" : {
#       "successful" : 5,
#       "failed" : 0,
#       "total" : 5
#    },
#    "took" : 49,
#    "timed_out" : false,
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "can_rate" : "true",
#                "image" : {
#                   "height" : "500",
#                   "url" : "http://ia.media-imdb.com/images/M/MV5BMTUyMTQ1NjA2OV5BMl5BanBnXkFtZTcwODQ1Njg3OA@@._V1_.jpg",
#                   "width" : "333"
#                },
#                "year" : "1974",
#                "rating" : "8.3",
#                "type" : "feature",
#                "tconst" : "tt0071315",
#                "num_votes" : "201317",
#                "title" : "Chinatown"
#             },
#             "_score" : 1,
#             "_type" : "film",
#             "_id" : "122",
#             "_index" : "top_films"
#          },
#          {
#             "_source" : {
#                "title" : "The Godfather: Part II",
#                "num_votes" : "727439",
#                "rating" : "9",
#                "type" : "feature",
#                "tconst" : "tt0071562",
#                "year" : "1974",
#                "image" : {
#                   "height" : "500",
#                   "width" : "333",
#                   "url" : "http://ia.media-imdb.com/images/M/MV5BNDc2NTM3MzU1Nl5BMl5BanBnXkFtZTcwMTA5Mzg3OA@@._V1_.jpg"
#                },
#                "can_rate" : "true"
#             },
#             "_score" : 1,
#             "_type" : "film",
#             "_id" : "3",
#             "_index" : "top_films"
#          }
#       ],
#       "max_score" : 1,
#       "total" : 2
#    }
# }

In this case, because our query was an exact match for both films (year = 1974), the score for both films is a 1. The average rating returned for these two films was 8.65: "The Godfather: Part II" had a rating of 9 and "Chinatown" had a rating of 8.3.

With metrics aggregations avg is only the beginning — one that many people are already familiar with — but Elasticsearch also provides geo-based metrics, percentiles, data stats, and several others that can be used for your analytics and reporting.

Metrics aggregations aren't the only game in town, though. Let's move on to bucket aggregations.

Bucket

A bucket aggregation is pretty much what it sounds like. If a document matches the criteria it will be bucketed. A document can belong to more than one bucket. If some documents belong to more than one bucket then our buckets could be visualized as a venn diagram. An interesting aspect of buckets is that they can also have sub-aggregations that are hierarchical - buckets within buckets, venn diagrams within venn diagrams. Hang on to your seat!

For our bucket example, we'll not get too crazy. We'll stick with 1974 just to keep our result set small (if you wanted to do the entire set, you could use match_all for your query). We're going to do an aggregation on the num_votes field where 0-500,000 votes is one bucket and greater than 500,000 votes is another bucket:

# Search for votes range grouping
my $results = $es->search(  
    index => 'top_films',
    type => 'film',
    body => {
        query => {
            term => {year => 1974}
        },
        aggs => {
            votes => {
                range => {
                    field => num_votes,
                    ranges => [
                        {to => 500000},
                        {from => 500001}
                    ]
                }
            }
        }
    }
);

As you can see, we're using the same query (year = 1974), but our aggregation in this case is a series of ranges applied against the num_votes field. We're calling this aggregation "votes". In our case, we have two buckets represented by these ranges. The first bucket is any integer up to 500,000. The second bucket starts from 500,001 to any integer. Let's have a look at our results:

[Tue Jan 12 16:59:16 2016] # Request to: https://aws-us-east-1-portal10.dblayer.com:10019
curl -XGET 'http://localhost:9200/top_films/film/_search?pretty=1' -d '  
{
   "aggs" : {
      "votes" : {
         "range" : {
            "field" : "num_votes",
            "ranges" : [
               {
                  "to" : 500000
               },
               {
                  "from" : 500001
               }
            ]
         }
      }
   },
   "query" : {
      "term" : {
         "year" : 1974
      }
   }
}
'

[Tue Jan 12 16:59:16 2016] # Response: 200, Took: 508 ms
# {
#    "hits" : {
#       "total" : 2,
#       "max_score" : 1,
#       "hits" : [
#          {
#             "_score" : 1,
#             "_source" : {
#                "image" : {
#                   "width" : "333",
#                   "url" : "http://ia.media-imdb.com/images/M/MV5BMTUyMTQ1NjA2OV5BMl5BanBnXkFtZTcwODQ1Njg3OA@@._V1_.jpg",
#                   "height" : "500"
#                },
#                "title" : "Chinatown",
#                "type" : "feature",
#                "num_votes" : "201317",
#                "year" : "1974",
#                "tconst" : "tt0071315",
#                "rating" : "8.3",
#                "can_rate" : "true"
#             },
#             "_index" : "top_films",
#             "_type" : "film",
#             "_id" : "122"
#          },
#          {
#             "_score" : 1,
#             "_source" : {
#                "tconst" : "tt0071562",
#                "can_rate" : "true",
#                "rating" : "9",
#                "year" : "1974",
#                "title" : "The Godfather: Part II",
#                "type" : "feature",
#                "num_votes" : "727439",
#                "image" : {
#                   "width" : "333",
#                   "url" : "http://ia.media-imdb.com/images/M/MV5BNDc2NTM3MzU1Nl5BMl5BanBnXkFtZTcwMTA5Mzg3OA@@._V1_.jpg",
#                   "height" : "500"
#                }
#             },
#             "_id" : "3",
#             "_type" : "film",
#             "_index" : "top_films"
#          }
#       ]
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "successful" : 5,
#       "total" : 5,
#       "failed" : 0
#    },
#    "took" : 17,
#    "aggregations" : {
#       "votes" : {
#          "buckets" : [
#             {
#                "to" : 500000,
#                "to_as_string" : "500000.0",
#                "doc_count" : 1,
#                "key" : "*-500000.0"
#             },
#             {
#                "from_as_string" : "500001.0",
#                "from" : 500001,
#                "doc_count" : 1,
#                "key" : "500001.0-*"
#             }
#          ]
#       }
#    }
# }

We see our two films returned. We also see the aggregation buckets. We have one film (notice the "doc_count") in the 0-500,000 range and we have one in the greater than 500,000 range.

Besides range-based buckets, there are histograms, date intervals, parent-child hierarchies, term groupings and more.

Wrapping Up

You may have started to realize just how rich Elasticsearch actually is from this series of articles. There is so much under the hood that we have not even begun to cover here, but hopefully you've got a solid grasp on how to use the Search::Elasticsearch perl module to get your Elasticsearch project going. In this article we covered querying via the Search APIs, specifically looking at some query and aggregation examples. Previous articles covered installing the module, connecting to our Elasticsearch instance, monitoring our cluster, and indexing our document set (including using some more advanced index options).

Happy Elastic-searching!

We at Compose would love to hear about your database projects. We know that others would, too. Visit our Write Stuff page to share what you've been developing, the challenges you've faced, or tips and tricks you've learned along the way. In the meantime, make sure to subscribe to Compose Articles or follow us on Twitter, Facebook or Google+ using the links below.