Elasticsearch.pm - Part 2: Basic Document and Index Methods

This article looks at how you can use the Search::Elasticsearch perl module to create, update, and delete indexes as well as how to perform some common index management operations. The previous article in this series touched on installing the Search::Elasticsearch perl module, connecting to Elasticsearch on Compose, and monitoring the cluster using some built-in methods. We also made use of the trace_to and log_to functions in Perl's Log::Any module to track our server interactions and HTTP requests and responses. In this article, we'll build on top of the client object we created last time in order to cover some basic aspects of indexing.

For those of you who like to follow along with the official documentation, in our examples below, we'll be using the Elasticsearch Document APIs as well as the Indices APIs via methods exposed in the Search::Elasticsearch perl module. The perl module methods for documents are part of the Search::Elasticsearch::Client::2_0::Direct sub-module and the methods for indices are part of the Search::Elasticsearch::Client::2_0::Direct::Indices sub-sub-module. Got that? Great! Time to dive in!

Basic Document Methods

When indexing documents, we have the option to create, index, update, or delete a document. It can get a little confusing — particularly between creating versus indexing — unless you understand what each does. Note too, that when using create() or index() on a document, if the index does not already exist it will be created automatically by these methods with default settings. Be aware of this because that may not be what you intend, but it can be a quick-and-easy way to kickstart an index.

So, depending on what your intention is with document indexing, make sure to choose the method correctly.

In our previous article, we created a simple index with one document using the index() method. For this article, we used the same method to index several documents - the top 250 films according to imdb voters. We were able to acquire a JSON file — thanks to our illustrious content curator — where each film is formatted as a JSON document:

{
  "title": "The Matrix",
  "num_votes": 1125526,
  "rating": 8.7,
  "year": "1999",
  "type": "feature",
  "can_rate": true,
  "tconst": "tt0133093",
  "image": {
    "url": "http://ia.media-imdb.com/images/M/MV5BMTkxNDYxOTA4M15BMl5BanBnXkFtZTgwNTk0NzQxMTE@._V1_.jpg",
    "width": 334,
    "height": 500
  }
}

We used standard Perl commands to parse through the file and feed each document to an index called "top_films" that we automatically created with the index() method. Here's an example of one film being indexed via our "$es" client object that we created in Part 1 of this series:

# Index a document
my $response = $es->index(  
    index => 'top_films',
    type => 'film',
    id => 17,
    body => {
        title => 'The Matrix',
        num_votes => 1125526,
        rating => '8.7',
        year => 1999,
        type => 'feature',
        can_rate => 'true',
        tconst => 'tt0133093',
        image => {
            url => 'http://ia.media-imdb.com/images/M/MV5BMTkxNDYxOTA4M15BMl5BanBnXkFtZTgwNTk0NzQxMTE@._V1_.jpg',
            height => 500,
            width => 334
        }
    }
);

In the above call, first we're capturing the server response from the index() method in a variable called "$response". In our case, we're actually logging every interaction with the server so this is redundant for us, but you may want to capture each server response individually like we're showing here for use in your script. We'll maintain this structure through all of our examples for your use. Next, in our index() call, we're specifying the name of the index ("top_films" in this case, which will be automatically created by the index() method if it does not already exist), the type of document we're indexing (for us, that's a "film"), the document id (our perl script is set to auto-increment the id for each film), and the body of the document with the fields and values from the JSON file.

Our log file shows us how the document got passed to Elasticsearch with a POST command. We then see the response. In this case, our response shows the 201 "created" from the server plus acknowledgement that the document was indexed:

[Thu Dec 31 11:49:29 2015] # Request to: https://aws-us-east-1-portal7.dblayer.com:10304
curl -XPOST 'http://localhost:9200/top_films/film/17?pretty=1' -d '  
{
   "year" : "1999",
   "tconst" : "tt0133093",
   "num_votes" : "1125526",
   "rating" : "8.7",
   "can_rate" : "true",
   "type" : "feature",
   "image" : {
      "height" : "500",
      "width" : "334",
      "url" : "http://ia.media-imdb.com/images/M/MV5BMTkxNDYxOTA4M15BMl5BanBnXkFtZTgwNTk0NzQxMTE@._V1_.jpg"
   },
   "title" : "The Matrix"
}
'

[Thu Dec 31 11:49:29 2015] # Response: 201, Took: 190 ms
# {
#    "_type" : "film",
#    "_index" : "top_films",
#    "_version" : 1,
#    "created" : true,
#    "_id" : "17"
# }

An important thing to note: the Search::Elasticsearch document methods will surround your JSON fields and values with double quotes so... if you already have double-quoted text as we did in our file, you'll need to strip them out before passing them to the module methods. We used Perl regex substitution in our script to strip out the double quotes before passing the document through the module to be indexed.

Basic Index Methods

For indexes, we can choose to create, get, or delete an index.

Since delete() is pretty self-explanatory and since we already created an index called "top_films" with default settings using the document index() method above, let's use get() on the index to see what we get back.

# Get index info
my $index_info = $es->indices->get(  
    index   => 'top_films'
);

In this example, we're using the get() method of indices for the "top_films" index. Our response looks like this:

[Thu Dec 31 16:24:49 2015] # Request to: https://aws-us-east-1-portal7.dblayer.com:10304
curl -XGET 'http://localhost:9200/top_films?pretty=1'

[Thu Dec 31 16:24:49 2015] # Response: 200, Took: 84 ms
# {
#    "top_films" : {
#       "aliases" : {},
#       "warmers" : {},
#       "settings" : {
#          "index" : {
#             "creation_date" : "1451591368501",
#             "version" : {
#                "created" : "1070399"
#             },
#             "number_of_shards" : "5",
#             "uuid" : "xiC6h9M0R7aDWxciAOHICQ",
#             "number_of_replicas" : "2"
#          }
#       },
#       "mappings" : {
#          "film" : {
#             "properties" : {
#                "rating" : {
#                   "type" : "string"
#                },
#                "can_rate" : {
#                   "type" : "string"
#                },
#                "num_votes" : {
#                   "type" : "string"
#                },
#                "year" : {
#                   "type" : "string"
#                },
#                "image" : {
#                   "properties" : {
#                      "url" : {
#                         "type" : "string"
#                      },
#                      "height" : {
#                         "type" : "string"
#                      },
#                      "width" : {
#                         "type" : "string"
#                      }
#                   }
#                },
#                "title" : {
#                   "type" : "string"
#                },
#                "type" : {
#                   "type" : "string"
#                },
#                "tconst" : {
#                   "type" : "string"
#                }
#             }
#          }
#       }
#    }
# }

In the above, we can see that there are four sections returned for our index: aliases, warmers, settings, and mappings. We'll get into each of these over the next couple articles, but for now, notice that our "top_films" index has no aliases or warmers and that our settings look pretty basic for the 3-node cluster setup you get with Compose. The next section — mappings — tells us the properties (the fields and their data types) for the "film" document type.

One of the benefits (and one of the drawbacks, depending on your intention) is that Elasticsearch is able to determine mappings automatically based on the document fields that are loaded for a given index and document type. If you see automatic mappings as a drawback rather than a benefit for your situation, you can turn off automatic mapping via the index settings (we'll cover that in our next article). We also see that all of our field data types have been set to string due to the double-quotes imposed on them by the module.

So, if you're thinking "Oh... that's not quite what I expected", then you might want to use the create() method of indices to create the index with the configurations you want it to have rather than let it be created automatically with default settings like we did. This is more of an advanced undertaking and requires familiarity with Elasticsearch indexing methods and settings as well as with the document set you're indexing. We'll be covering indexing options in more detail in the next article in this series to help you decide. For many situations, the default settings work just fine, but if you want to use create(), here's a pseudo-code outline for how to do it (we've only populated a single field and data type mapping to give you the gist of how it works):

# Create index (basic outline in pseudo-code)
my $response = $es->indices->create(  
    index   => 'top_films',
    body    => {
        settings => {
            #index settings go here
        },
        aliases => {
            #aliases go here
        },
        warmers => {
            #warmers go here
        },
        mappings => {
            "film" => {
                properties => {
                    "title" => {type => string},
                    #other properties go here
                }
            },
            #other document type mappings go here
        }
    }   
);

Now, rather than using create(), you might be wondering if we can use some kind of update() method to adjust our existing index. There are actually several ways to update an existing index, but there is not a specific update() method. In the next article that gets into index options and the final article that dives into querying, we'll look at what some of the index configurations do and consider whether we can (or want to) update our existing index. In some cases, the create() index method is the only way to get the index to be exactly how you might want it. Just something to keep in mind as you're planning your Elasticsearch project.

Exists

If you're not one of those people that keeps a flawless running tally in your mind of all the documents in all the indexes, then the various exists() methods will be your friends. All of these methods return a boolean value: 1 for "exists" and 0 for "does not exist". Let's start with finding out whether an index by a particular name already exists:

# Check if index exists
my $index_exists = $es->indices->exists(  
    index => 'top_films'
);

Here we're using the exists() method of indices. We're checking to see whether an index called "top_films" exists (what do you wager?). Our "$index_exists" variable will give us either a "1" or a "0" as a result, but since we're tracing and logging to a file, we can check the response there:

[Thu Dec 31 14:32:29 2015] # Request to: https://aws-us-east-1-portal10.dblayer.com:10019
curl -XHEAD 'http://localhost:9200/top_films?pretty=1'

[Thu Dec 31 14:32:29 2015] # Response: 200, Took: 420 ms
# 1

As you can see, because the index exists, we get a 200 "ok" response and also a "1" returned.

By tweaking the above slightly and running the exists_type() method instead, we can also find out if a particular document type already exists in the index (by default, the index itself must exist for that to happen):

# Check if document type in the index exists
my $type_exists = $es->indices->exists_type(  
    index => 'top_films',
    type => 'films'
);

Here's what we get in the logfile if either the index or the document type do not exist (a 404 response and 0 value), which we are getting here because of a typo in "type" (says "films" instead of "film"):

[Thu Dec 31 14:29:06 2015] # Request to: https://aws-us-east-1-portal7.dblayer.com:10304
curl -XHEAD 'http://localhost:9200/top_films/films?pretty=1'

[Thu Dec 31 14:29:06 2015] # Response: 404, Took: 400 ms
# 

And finally, you can check if a particular document exists in the given index:

my $doc_exists = $es->exists(  
    index   => 'top_films',
    type    => 'film',
    id      => '17'
);

Notice that, because the above request is for a document rather than an index, we do not use indices, but we do have to provide a document id. We also have to specify what type of document it is.

Knowing whether or not the index, document type, or document exist will help us decide what to do next. Do we update or overwrite the document, add new documents, delete and recreate the index, or something else? We have all the methods described above at our disposal.

Common Index Management Operations

To wrap up this article on basic document and index methods, we'll touch on common index management operations. As with any index, the common operational functions you'd expect are available through the Search::Elasticsearch perl module. These include clearing the cache clear_cache(), flushing the index flush(), refreshing the index refresh(), and merging index segments to optimize performance forcemerge(). These methods follow the same syntax we've seen in this article for the other index methods. Here's the syntax for an index refresh:

# Refresh existing index
my $response = $es->indices->refresh(  
    index => 'top_films'
);

Our log file shows us the result of the refresh:

[Thu Dec 31 16:39:50 2015] # Request to: https://aws-us-east-1-portal10.dblayer.com:10019
curl -XPOST 'http://localhost:9200/top_films/_refresh?pretty=1'

[Thu Dec 31 16:39:50 2015 # Response: 200, Took: 106 ms
# {
#    "_shards" : {
#       "successful" : 11,
#       "failed" : 0,
#       "total" : 15
#    }
# }

You can learn more about the management operations functions in the Elasticsearch documentation on index status management.

Next

We hope this article has helped bring some clarity to basic indexing functionality in Elasticsearch and, more specifically, given you the tools to perform indexing tasks through the Search::Elasticsearch perl module for your Elasticsearch project. In the next article, we'll look at more advanced index options using Search::Elasticsearch. After that, in our final article of the series, we'll get into querying our index and looking at some search options.

If you're using Elasticsearch.pm or just have an example to share about how you've configured your Elasticsearch indexes, we'd love to hear about it via our Write Stuff program. Do tell! And as always, make sure you're getting the latest from Compose by subscribing to Compose Articles or following us on Twitter, Facebook or Google+ - the appropriate links are below.