Elasticsearch.pm - Part 3: Index Options

In this third article of our Elasticsearch.pm series, we're going to get into index options for our Elasticsearch instance. If you're just joining us, you may want to take a look back at the first article in the series where we reviewed how to connect and perform cluster monitoring with the Search::Elasticsearch perl module. We'll be building on top of the client object we created in that article and also on the document and index methods we reviewed in Part 2 of this series. Each of these previous articles help set the stage for looking at the more advanced indexing options we'll cover in this article.

We won't be able to cover all of the Elasticsearch indexing options in this article since there are many advanced settings and all of the methods have additional parameters that can be optionally applied, but we'll look closely at some key areas you should have a grasp of for your Elasticsearch project.

Aliases

Aliases are kind of like nicknames you can give to indexes, groups of indexes or only parts of indexes. All of the index methods can take the name of an alias instead of an index name.

Aliases can be added at index creation or later. They can also be updated or deleted later. A good thing to do, though, before performing any of these functions, is to run the get_aliases() method to see what aliases already exist across all of your indexes. That way you can make sure that you're not associating an index to an existing alias inadvertently or you can use the method to confirm that an alias does exist if you wanted to verify the indexes associated with it.

Since we know we don't have any aliases (we didn't create any in the previous articles or in this one yet), we can update our "top_films" index so that it has an alias by using the put_alias() method:

# Give existing index an alias
my $response = $es->indices->put_alias(  
    index => 'top_films',
    name  => 'imdb_data'
);

In the example above, we're using the put_alias() method of indices in the "$es" client object we created in Part 1 of this series. We're setting our "top_films" index to have an alias named "imdb_data". Because we're logging all the server interactions and responses to our "$es" client object, we can check the log file to verify that our call was successful:

[Thu Dec 31 17:21:53 2015] # Request to: https://aws-us-east-1-portal10.dblayer.com:10019
curl -XPUT 'http://localhost:9200/top_films/_alias/imdb_data?pretty=1'

[Thu Dec 31 17:21:53 2015] # Response: 200, Took: 166 ms
# {
#    "acknowledged" : true
# }

Now let's use get_aliases() to confirm:

# Get aliases
my $response = $es->indices->get_aliases();  

And the response:

[Thu Dec 31 17:25:18 2015] # Request to: https://aws-us-east-1-portal10.dblayer.com:10019
curl -XGET 'http://localhost:9200/_aliases?pretty=1'

[Thu Dec 31 17:25:18 2015] # Response: 200, Took: 89 ms
# {
#    "top_films" : {
#       "aliases" : {
#          "imdb_data" : {}
#       }
#    }
# }

Our example here added an alias to one index. You can also use an alias across multiple indexes by assigning it to each of them. You can also do what's called a filter alias where you assign an alias to only a part of an index by applying a filter on it. For example, you might want to alias only part of an index based on the existence of a specific field value. We could apply a filter alias on our "top_films" index to refer to only films that were released in the year 1975 or only had a 9.0 rating. You can learn more about aliases in the Elasticsearch documentation.

Mappings

When we ran the get() index method in our previous article, we saw how Elasticsearch automatically inferred mappings for our "top_films" index. Now, if we want to add new fields to our mappings, such as a description field for example, we can use the put_mapping() method to do that:

# Update mapping with a new field
my $response = $es->indices->put_mapping(  
    index => 'top_films',
    type  => 'film',
    body  => {
        film => {
            properties => {
                description => {type => string}
            }
        }
    }
);

Above, we're using the put_mapping() method of indices to add a new field mapping to the "top_films" index for the "film" document type. We are adding the "description" field with a string data type as a film property. Our log shows success:

[Thu Dec 31 18:51:04 2015] # Request to: https://aws-us-east-1-portal7.dblayer.com:10304
curl -XPUT 'http://localhost:9200/top_films/_mapping/film?pretty=1' -d '  
{
   "film" : {
      "properties" : {
         "description" : {
            "type" : "string"
         }
      }
   }
}
'

[Thu Dec 31 18:51:04 2015] # Response: 200, Took: 205 ms
# {
#    "acknowledged" : true
# }

Now that we have this new field, we can use the update() document method briefly discussed in Part 2 in the Document Methods section to add descriptions to the documents in our index. Updating a document, in this case "The Matrix" (document id 17), would look like this:

# Update a document with a new field
my $response = $es->update(  
    index => 'top_films',
    type => 'film',
    id => 17,
    body => {
        doc => {
            description => 'A computer hacker learns from mysterious rebels about the true nature of his reality and his role in the war against its controllers.'
        }
    }
);

Using put_mapping(), we could also add a whole new document type to our index with fields and data types for it. For example, we might decide to add a document type called "actors" and add the fields associated with actors such as "name", "age", etc...

What we can't do with put_mapping() is change the document types, fields, and data types that are already there. So, if we wanted to make the "year" field a numeric data type, we're out of luck. To get around this, we'd need to either create the mappings we want from the start using the create() index command we learned about in the last article or at least turn off the automatic mapping functionality by creating a new index with that setting.

For many people just getting started with Elasticsearch or for those who just want the ease-of-use they're accustomed to with the JSON flexible-schema model, automated mappings probably aren't too much of an issue and they actually help you kickstart an index. However, as you begin to need to dial-in more relevant and precise search result sets, schema becomes very important and you're going to want to get into the guts of your mappings ahead of time.

A last gotcha on the mappings: if you use the same field name across more than one document type in your index, that field has to use the same data type and analysis tools. For example, if we had a "film" document type and an "actor" document type and they both had the field "name", both fields would need to use the same data type (probably string in this case, so that's not such a problem) and the same analysis tools (analyzers, tokenizers, and filters, which can impact indexing and searching significantly - we'll touch on these below). The Elasticsearch documentation about mappings is required reading for anyone who wants to have more than just automated mappings in their index.

Index modules

Index modules hold all the settings information for your indexes. You can set things like the index refresh interval, whether or not the automatic mappings kicks in, and even overwrite the creation date of the index if that was useful for your situation (though we're not sure what a valid use case would be for that). We won't get too much into the weeds here since there are a lot of different settings, but since we mentioned a couple times now about turning off the automatic mapping, let's use that as an example for configuring index settings:

# Turn off automatic mappings
## Close index first
my $close_response = $es->indices->close(  
    index => 'top_films'
);

## Set automatic mappings to false
my $response = $es->indices->put_settings(  
    index => 'top_films',
    body => {
        "index.mapper.dynamic" => "false"
    }
);

## Open index again
my $open_response = $es->indices->open(  
    index => 'top_films'
);

Note that before applying our new setting, we first have to close the index and then later have to open it again. This is because some index settings are considered to be "static", meaning they can't be changed on-the-fly against an index that is in use. By closing the index first, we momentarily take it offline to make our change and then we open it up for business again. If we don't do this, we'll get an error telling us that we "can't update non-dynamic settings for open indices".

Here is what our log file shows:

[Thu Dec 31 19:39:47 2015] # Request to: https://aws-us-east-1-portal10.dblayer.com:10019
curl -XPOST 'http://localhost:9200/top_films/_close?pretty=1'

[Thu Dec 31 19:39:47 2015] # Response: 200, Took: 129 ms
# {
#    "acknowledged" : true
# }

[Thu Dec 31 19:39:47 2015] # Request to: https://aws-us-east-1-portal7.dblayer.com:10304
curl -XPUT 'http://localhost:9200/top_films/_settings?pretty=1' -d '  
{
   "index.mapper.dynamic" : "false"
}
'

[Thu Dec 31 19:39:47 2015] # Response: 200, Took: 198 ms
# {
#    "acknowledged" : true
# }

[Thu Dec 31 19:39:47 2015] # Request to: https://aws-us-east-1-portal10.dblayer.com:10019
curl -XPOST 'http://localhost:9200/top_films/_open?pretty=1'

[Thu Dec 31 19:39:47 2015] # Response: 200, Took: 105 ms
# {
#    "acknowledged" : true
# }

Note that this change will only apply to any new document types and new fields added to our index. Existing ones are already set, which is why, if you have specific requirements, you may want to create an index with this setting from the start as well as use the index create() method to specify your field mappings and data types before indexing any documents.

To learn more about index modules, visit the Elasticsearch documentation.

Analysis

Analysis is really where the power of Lucene makes itself evident in Elasticsearch. Analysis consists of a set of tools (analyzers, tokenizers and filters) you can use for manipulating your document content to improve indexing and the subsequent search results as well. An analyzer includes a tokenizer and may optionally have one or more filters. There are several built-in analyzers available in Elasticsearch or you can create your own by mixing and matching the built-in tokenizers and filters.

Let's start by getting a basic understanding of tokenizers. Tokenizers determine how a field value gets broken up into tokens for indexing. In some cases, we may want each word to be a token and indexed separately, so we would use the whitespace tokenizer to make each word its own token (words are identified as being delimited by whitespaces). In other cases we might want the entire value to be a single token, which can be useful for indexing unique identifiers, so we would use the keyword tokenizer.

We may then want to apply one or more filters on top of that. The stemmer filter is a common one used in search. For example, if we had the word "tokenizer" as a token, we might want to be able to stem it to just "tokenize" so that we could get matches on it with "tokenize", "tokenizes", "tokenized", and "tokenizer". Another common filter applies stopwords. These will be words we don't want to index, typically common words that don't add much value like "a", "the", "and", etc... We can even use filters to strip out or replace characters within tokens. One we may want to consider is the mapping character filter in order to catch both the American and British spellings of "tokenize" so that we will also match on "tokenise".

If no analyzer is specified, then Elasticsearch will default to the built-in Standard analyzer. This analyzer uses the standard tokenizer, the lowercase token filter and the stop token filter. It also includes the standard token filter, which is just an empty placeholder... there in case Elasticsearch discovers a use they want to apply it for. The short version of what this analyzer does is that it breaks up text using a grammar-based unicode algorithm that works for most European languages to create meaningful tokens of characters, words, phrases, or sentences. It then lowercases all the tokens and removes stopwords.

In our case, we're going to create a simple custom analyzer for demonstrating how to add analysis tools to our index settings. We'll use the put_settings() method like this:

# Add analysis tools
## Close index first
my $close_response = $es->indices->close(  
    index => 'top_films'
);

## Add analyzer
my $analysis_response = $es->indices->put_settings(  
    index => 'top_films',
    body  => {
        film => {
            properties => {
                title => {
                    analysis => {
                        analyzer => {
                            keyword_lower => {
                                type => custom,
                                tokenizer => keyword,
                                filter => {lowercase}
                            }
                        },
                        tokenizer => {
                            keyword => {
                                type => keyword
                            }
                        },
                        filter => {
                            lowercase => {
                                type => lowercase
                            }
                        }
                    }
                }
            }
        }
    }
);

## Open index again
my $open_response = $es->indices->open(  
    index => 'top_films'
);

In this case, we added a custom analyzer to our title field mapping, which we named "keyword_lower". It uses the keyword tokenizer and the lowercase filter. Note that before applying our new analyzer, we had to close the index first, like we did above, and then open it again afterward. That's because analysis settings are "static".

Here is what our log file shows:

[Thu Dec 31 19:56:01 2015] # Request to: https://aws-us-east-1-portal7.dblayer.com:10304
curl -XPOST 'http://localhost:9200/top_films/_close?pretty=1'

[Thu Dec 31 19:56:02 2015] # Response: 200, Took: 165 ms
# {
#    "acknowledged" : true
# }

[Thu Dec 31 19:56:02 2015] # Request to: https://aws-us-east-1-portal10.dblayer.com:10019
curl -XPUT 'http://localhost:9200/top_films/_settings?pretty=1' -d '  
{
   "film" : {
      "properties" : {
         "title" : {
            "analysis" : {
               "analyzer" : {
                  "keyword_lower" : {
                     "type" : "custom"
                     "tokenizer" : "keyword"
                     "filter" : "lowercase"
                  }
               }
               "tokenizer" : {
                  "keyword" : {
                     "type" : "keyword"
                  }
               }
               "filter" : {
                  "lowercase" : {
                     "type" : "lowercase"
                  }
               }
            }
         }
      }
   }
}
'

[Thu Dec 31 19:56:02 2015] # Response: 200, Took: 203 ms
# {
#    "acknowledged" : true
# }

[Thu Dec 31 19:56:02 2015] # Request to: https://aws-us-east-1-portal7.dblayer.com:10304
curl -XPOST 'http://localhost:9200/top_films/_open?pretty=1'

[Thu Dec 31 19:56:02 2015] # Response: 200, Took: 117 ms
# {
#    "acknowledged" : true
# }

Finding the right combination of analysis tools comes from experience with search applications accumulated over time and from some trial-and-error with your given content. You may want to test out some configurations before applying analysis tools to your index. You can do that by using the analyze() method against some document content and then tweak the analysis tools you use until you feel the result is appropriate for your use case. This lets you try different analysis tools without committing to adding them to your index until you're ready. Here's an example of how we tested the one we created above:

# Try out analysis tools
my $results = $es->indices->analyze(  
    index => 'top_films',
    tokenizer => keyword,
    filters => [lowercase],
    body  => 'The Matrix'
);

The results we got back show how the title "The Matrix" will be indexed with these analysis tools:

[Thu Dec 31 19:24:45 2015] # Request to: https://aws-us-east-1-portal10.dblayer.com:10019
curl -XGET 'http://localhost:9200/top_films/_analyze?filters=lowercase&pretty=1&tokenizer=keyword' -d '  
The Matrix'

[Thu Dec 31 19:24:45 2015] # Response: 200, Took: 481 ms
# {
#    "tokens" : [
#       {
#          "token" : "the matrix",
#          "position" : 1,
#          "end_offset" : 10,
#          "type" : "word",
#          "start_offset" : 0
#       }
#    ]
# }

As expected, we can see that our title "The Matrix" was tokenized as a single token and lowercased. Note that if you're trying out a custom analyzer, you'll need to use the tokenizer and the filters you will use for your analyzer, like we showed above. Otherwise, if you're using one of the built-in analyzers, you only need to provide the name of the analyzer to the analyze() method. If you use the analyze() method without passing in either an analyzer or a combination of a tokenizer and filters, the standard analyzer will be used.

You can get a deeper understanding of the analysis tools and the analyze option in the Elasticsearch documentation.

Warmers

Warmers are used to get the index ready for searching by front-loading some heavy or common searches.

There are many options for warming which use query, aggregation, and sort operations that allow you to dial-in the most intensive requests ahead of time. A simple warmer we can apply to our example uses the match_all query logic. This warms all the documents so you'll want to use it with care if you have a very large set you're indexing. We have a small document set so it makes sense for us:

# Create a warmer
my $response = $es->indices->put_warmer(  
    index   => 'top_films',
    type    => 'film',
    name    => 'match_all',
    body    => {
        query => {
            match_all => {}
        }
    }
);

And we get a success response:

[Thu Dec 31 19:57:48 2015] # Request to: https://aws-us-east-1-portal10.dblayer.com:10019
curl -XPUT 'http://localhost:9200/top_films/film/_warmer/match_all?pretty=1' -d '  
{
   "query" : {
      "match_all" : {}
   }
}
'

[Thu Dec 31 19:57:49 2015] # Response: 200, Took: 647 ms
# {
#    "acknowledged" : true
# }

The Elasticsearch documentation on warmers can help you figure out what might be best for your situation.

Wrapping Up

So now, if we run get() on our index again, like we did in the previous article, we'll see all of our new settings as well as the ones that were automatically applied previously:

[Thu Dec 31 20:08:38 2015] # Request to: https://aws-us-east-1-portal10.dblayer.com:10019
curl -XGET 'http://localhost:9200/top_films?pretty=1'

[Thu Dec 31 20:08:38 2015] # Response: 200, Took: 408 ms
# {
#    "top_films" : {
#       "settings" : {
#          "index" : {
#             "number_of_replicas" : "2",
#             "version" : {
#                "created" : "1070399"
#             },
#             "creation_date" : "1452134910252",
#             "mapper" : {
#                "dynamic" : "false"
#             },
#             "number_of_shards" : "5",
#             "film" : {
#                "properties" : {
#                   "title" : {
#                      "analysis" : {
#                         "filter" : {
#                            "lowercase" : {
#                               "type" : "lowercase"
#                            }
#                         },
#                         "tokenizer" : {
#                            "keyword" : {
#                               "type" : "keyword"
#                            }
#                         },
#                         "analyzer" : {
#                            "keyword_lower" : {
#                               "tokenizer" : "keyword",
#                               "type" : "custom",
#                               "filter" : {
#                                  "lowercase"
#                               }
#                            }
#                         }
#                      }
#                   }
#                }
#             },
#             "uuid" : "g6vweF9yQ2K5vzVVKpd_3w"
#          }
#       },
#       "aliases" : {
#          "imdb_data" : {}
#       },
#       "warmers" : {
#          "match_all" : {
#             "types" : [
#                "film"
#             ],
#             "source" : {
#                "query" : {
#                   "match_all" : {}
#                }
#             }
#          }
#       },
#       "mappings" : {
#          "film" : {
#             "properties" : {
#                "image" : {
#                   "properties" : {
#                      "url" : {
#                         "type" : "string"
#                      },
#                      "width" : {
#                         "type" : "string"
#                      },
#                      "height" : {
#                         "type" : "string"
#                      }
#                   }
#                },
#                "year" : {
#                   "type" : "string"
#                },
#                "can_rate" : {
#                   "type" : "string"
#                },
#                "rating" : {
#                   "type" : "string"
#                },
#                "title" : {
#                   "type" : "string"
#                },
#                "description" : {
#                   "type" : "string"
#                },
#                "tconst" : {
#                   "type" : "string"
#                },
#                "type" : {
#                   "type" : "string"
#                },
#                "num_votes" : {
#                   "type" : "string"
#                }
#             }
#          }
#       }
#    }
# }

We've got our automatic mapping set to false, our "keyword_lower" analyzer for the title field, our alias set to "imdb_data", our "match_all" query warmer, and even our new description field mapping.

Next

In this article we looked at how to use the Search::Elasticsearch perl module to apply a variety of index options on an existing index and we considered when creating an index from scratch with the preferred settings is the better approach. In our next article, we'll get into querying and search results using the Search::Elasticsearch perl module.