Using Query String Queries in Elasticsearch
PublishedIn Elasticsearch, query string queries are their own breed of query - loads of functionality for full text search rolled into one sweet little package. In this article, we'll take a closer look at why query string queries are special and how you can make use of them.
Search Lite
Elasticsearch: The Definitive Guide explains that the query string query type uses what they call "Search Lite", where all the query parameters are passed in the query string. Because of this, query string queries use a different syntax than the standard request body we've covered in previous articles, such as Elasticsearch Query-Time Strategies and Techniques for Relevance: Part I and Part II.
Note that the request body format for querying is the recommended approach from Elasticsearch since it is considered robust and provides extensive functionality. However, if you just need to do a quick-and-dirty full text search that has some power behind it, then using the "q" parameter in search (the query string query shortcut) is the way to go. Generally, query string queries, and their cousins (simple query string queries), will be most effective when used in development or QA testing, or when made available to power users who know the syntax like the back of their hand.
Let's take a closer look at this query type to understand what it can do for us by searching against the IMDB Top 250 Films, which we have loaded into our Elasticsearch instance an index called "top_films" with a document type named "film".
The query string "query"
If you look at the Elasticsearch documentation for the Search APIs "Search" page, you'll notice all the examples there use the "q" parameter for search. This is a shortcut way of accessing query string queries. Using the "q" parameter for search is equivalent to the "query" option in JSON-formatted query string queries (which we'll get into more details on later in the article when we look at the setting options).
We're going to start by exploring just the query and its syntax since that's the bare bones needed for query string queries, though you'll see that it's chock full of features just on its own.
Let's run through some basic default settings so we know where we stand when we construct a query using the query string query type:
- If no field is specified in the query, then the _all field is searched automatically. The _all field is a special field that is constructed by concatenating the values of all the other fields in your document so it's got all the terms found elsewhere, making it ideal for full text searching. The _all field is generated in the background when your document is indexed unless you've explicitly disabled it via the index metadata.
- If multiple fields are specified for search, then
bool
is automatically applied. - Multiple query terms will be "OR"d together by default, unless you indicate that they make up a phrase by encapsulating them in double-quotes.
- All expanded terms are lowercased. Expanded terms will include those from stemming and fuzzy matches, for example.
There are some other more advanced usage defaults, but now that we know the basic ones, we'll use the "q" parameter -- the query string queries shortcut -- for some search examples. To follow along, just copy the HTTP connection string for your Elasticsearch deployment from the "Overview" page in the Compose administrative web console, supplying the username and password appropriate for your instance. In this first example, we'll show the full URL path so you can see how it's formed, but for the rest of the examples, we won't show the whole connection string, just the part starting with our "top_films" index through the "q" parameter.
Getting started, then... the simplest query we can construct is a single term query without any additional specification, for example:
https://admin:[password]@aws-us-east-1-portal10.dblayer.com:10019/top_films/film/_search?q=godfather
Here, we're just searching for "godfather". Since we didn't specify a particular field, the _all field will be used. As our result, we get 2 hits: The Godfather and the The Godfather: Part II, since those are the only ones from that series which have made it into the top 250 films:
{
"hits" : {
"total" : 2,
"max_score" : 1.1862482,
"hits" : [
{
"_index" : "top_films",
"_score" : 1.1862482,
"_source" : {
"title" : "The Godfather",
"year" : "1972"
},
"_type" : "film",
"_id" : "2"
},
{
"_source" : {
"title" : "The Godfather: Part II",
"year" : "1974"
},
"_score" : 1.1862482,
"_type" : "film",
"_id" : "3",
"_index" : "top_films"
}
]
}
}
If you'd like to know more about how the scores are arrived at for each of the hits, then check out our article How Scoring Works in Elasticsearch.
Multiple terms
Now that we have our baseline for a "godfather" query, let's add to our original query a bit to create a multi-term query using the terms "godfather" and "part". We'll join the two terms with the +
sign for the proper URL encoding of the query:
/top_films/film/_search?q=godfather+part
In this case, because multiple terms are "OR"d together by default, we get 3 hits - the 3 films that have either the term "godfather" or the term "part" in them: The Godfather and The Godfather: Part II which we saw before, plus Harry Potter and the Deathly Hallows: Part 2:
{
"hits" : {
"total" : 3,
"hits" : [
{
"_index" : "top_films",
"_type" : "film",
"_id" : "3",
"_score" : 1.6776081,
"_source" : {
"title" : "The Godfather: Part II",
"year" : "1974"
}
},
{
"_index" : "top_films",
"_source" : {
"year" : "1972",
"title" : "The Godfather"
},
"_score" : 0.41940203,
"_type" : "film",
"_id" : "2"
},
{
"_index" : "top_films",
"_source" : {
"title" : "Harry Potter and the Deathly Hallows: Part 2",
"year" : "2011"
},
"_score" : 0.35948747,
"_id" : "216",
"_type" : "film"
}
],
"max_score" : 1.6776081
}
}
Phrases
Let's try that same search again, but this time we'll use double quotes to indicate that the two words form a phrase (we're using URL encoding "%22" for double quotes and "%20" for a whitespace):
/top_films/film/_search?q=%22godfather%20part%22
Now that we've indicated to treat the two words as a phrase, we only get 1 hit back: The Godfather: Part II:
{
"hits" : {
"total" : 1,
"max_score" : 2.3724964,
"hits" : [
{
"_score" : 2.3724964,
"_source" : {
"title" : "The Godfather: Part II",
"year" : "1974"
},
"_id" : "3",
"_index" : "top_films",
"_type" : "film"
}
]
}
}
Fields
If we want to search in a specified field (or fields) for terms, we can indicate that with our query syntax. For example, we can search the title field for "godfather" and the year field for 1974 (the year that The Godfather: Part II was released). Notice the "%3A" URL encoding for a colon between the field name and the term:
/top_films/film/_search?q=title%3Agodfather+year%3A1974
In this case, we'll get 3 results as well since each part of the query is "OR"d together: our two Godfather films because they match the term "godfather" in the title and Part II also matches the term 1974 for the year, and the film Chinatown because it is the only other film in the top 250 films that was released in 1974:
{
"hits" : {
"total" : 3,
"hits" : [
{
"_id" : "3",
"_score" : 2.8478138,
"_index" : "top_films",
"_type" : "film",
"_source" : {
"year" : "1974",
"title" : "The Godfather: Part II"
}
},
{
"_source" : {
"year" : "1972",
"title" : "The Godfather"
},
"_type" : "film",
"_id" : "2",
"_score" : 1.6665416,
"_index" : "top_films"
},
{
"_source" : {
"year" : "1974",
"title" : "Chinatown"
},
"_type" : "film",
"_score" : 0.0906736600000001,
"_id" : "122",
"_index" : "top_films"
}
],
"max_score" : 2.8478138
}
}
Wildcards
Let's turn pat of our query into a wildcard search. In this case we'll prefix "father" for any matching characters using the *
special character:
/top_films/film/_search?q=title%3A*father+year%3A1974
Now we get back 4 hits: the 3 we saw above, plus In the Name of the Father because it matched to our wildcard query for "*father" in the title:
{
"hits" : {
"max_score" : 1.4142135,
"total" : 4,
"hits" : [
{
"_index" : "top_films",
"_id" : "3",
"_score" : 1.4142135,
"_type" : "film",
"_source" : {
"title" : "The Godfather: Part II",
"year" : "1974"
}
},
{
"_id" : "2",
"_index" : "top_films",
"_score" : 0.35355338,
"_type" : "film",
"_source" : {
"year" : "1972",
"title" : "The Godfather"
}
},
{
"_score" : 0.35355338,
"_index" : "top_films",
"_id" : "122",
"_source" : {
"title" : "Chinatown",
"year" : "1974"
},
"_type" : "film"
},
{
"_index" : "top_films",
"_id" : "185",
"_score" : 0.35355338,
"_type" : "film",
"_source" : {
"title" : "In the Name of the Father",
"year" : "1993"
}
}
]
}
}
We can just keep building more and more complex queries in this way. Some options will increase the number of hits and some will decrease them. We won't go through examples of all the options available in the query syntax here, but we'll do a quick review of other special characters and syntax for more advanced functionality.
Other query options
Boosting can be added to any term or field specified in the query when there is more than one. For example, in our query above where we searched for "godfather" in the title field and 1974 in the year field, we could add boosting to the title field, indicated by the carat
^
special character, to increase the scores of documents found with that match so they would be ranked higher than matches with the year field. Here's how that example could look, where we're doubling the significance of matches to "godfather" in the title field by boosting by 2:q=title%3Agodfather^2+year%3A1974
.Fuzziness is set to "AUTO" by default, which means that up to a maximum of 2 characters in a term may be replaced, removed or added, but the behavior is based on the length of the term specified in the query. For example, for a term with 0-2 characters, an exact match is required. For a term with 3-5 characters, then only 1 character may be be replaced, removed or added. You can indicate fuzziness for a term by using the
~
special character next to it and optionally setting the number of characters you'd like to try changing, which will override the default setting. An example would look like this:q=man~1
. Here we're searching for the term "man", but we're allowing for fuzziness for up to a 1 character change, so matches to terms like the following are possible: man, max, min, can, men, mad. By default up to 50 alternates will be generated for fuzzy matching.Proximity for a phrase can be indicated by using the
~
special character with a number indicating the allowable slop. For example, we used the phrase "godfather part" in one of our examples above to search for an exact match on the phrase. We can use slop to set the number of words that are allowed between the terms in our phrase. By doing so, the proximity of the terms becomes becomes a mechanism for matching and scoring. We can set the slop like this:q=%22godfather%20part%22~5
. Notice the~5
at the end of the phrase query. This will allow us to find any documents where the terms "godfather" and "part" are within 5 words of each other. Also, because we haven't specified a field here for our query, the _all field will be used. Due to the concatenation of field values in the _all field, which we mentioned above, we could potentially find a match to "godfather" in one field and "part" in another field, which just happen to be concatenated within 5 words of each other in the _all field. Just something to take into consideration when deciding whether to specify fields in your query or not.Ranges can be indicated by using square brackets
[
and]
for inclusive searches and curly braces{
and}
for exclusive searches as "Minimum TO Maximum". As an example, we could do a range search in our year field. That query could look like this for an inclusive search for films released in the years 1970-1975:q=year%3A[1970%20TO%201975]
. Again, the URL encoding for the colon between our field and value is used "%3A" and also the whitespace surrounding the "TO" part of our range as "%20". This query would return all films released in 1970, 1971, 1972, 1973, 1974, or 1975. By exchanging the square brackets for curly braces, we'll make the query exclusive:q=year%3A{1970%20TO%201975}
. Now only films released in 1971, 1972, 1973, or 1974 will be returned. Ranges can be used for integers, dates, and strings, and can also make use of greater than>
, less than<
, greater than or equal to>=
and less than or equal to<=
operators.Negation of a term can also be used in a query. Using the example we just did above for ranges, let's say we wanted all the films released between 1970 and 1975, inclusive, except for the year 1974. We can negate that year by prefacing it as a term with the
-
special character like this:q=year%3A[1970%20TO%201975]-1974
. This query will yield all the films released in the years 1970, 1971, 1972, 1973, and 1975. 1974 has been negated.Order of operations, just like in mathematical formulas, can be indicated by using nested parentheses to group parts of a query. For example, we know there was a movie about old men or an old man, but we can't remember the title. We could search for "man" or "men" and "old" using parentheses to group each part of the query like this:
q=(man+men)AND(old)
. Our result from the top 250 films will be No Country for Old Men.Regular expressions can also be used in these queries by encapsulating the pattern inside forward slashes
/
. Here's an example where we know the first word begins with an "m" and is followed by the word "max":q=\/m.*%20max\/
. Notice how we've escaped the forward slashes for the regular expression with backslashes and also the whitespace is URL encoded as %20. In the top 250 films, this query will yield us Mad Max: Fury Road as well as a film entitled Mary and Max. Be aware that the.*
wildcard is greedy so it will include all characters up until a match with " max" is found, which is why the " and" was matched in Mary and Max. We are showing this example because it can be a quite intensive query if there are large value so use wildcards in regular expressions with caution.
As you can see, there's a lot of functionality available to you in just formulating query string queries and using the "q" parameter in search makes it pretty simple to do.
But what if you want to change some of those defaults we discussed? If, instead of using the "q" parameter in search as we've shown in all the examples above, you choose to construct a full-blown query string query, then you have many more options available to you. We'll look at these next.
Query string query settings
There are many different settings available in query string queries and you can set them as options during query construction. Let's have a closer look at some of them.
First, as we mentioned above, the default operator is "OR". We showed an example above where we specifically used "AND", but you can change the default operator to "AND". This is one of the settings that can also be changed for the "q" parameter so we'll show both methods.
First, let's take the query we did above where we searched for the individual terms "godfather" and "part". If you remember, we got 3 hits: the 2 Godfather films and then also Harry Potter and the Deathly Hallows: Part 2 because it contains the term "part" and our query "OR"d the two terms. We can change that behavior by setting the default_operator
setting as an additional parameter in the URL alongside our "q" parameter. It'd look like this:
/top_films/film/_search?default_operator=AND&q=godfather+part
Now, because we're "AND"ing together our two search terms by default, we'll only get The Godfather: Part II back since it's the only film that contains both terms.
The default_operator
setting is one of the only ones from the query string query settings that can be set as an additional parameter in the URL with the "q" parameter. All the other settings we'll cover below can only be used within the query string query construction, which is formatted as JSON. So, here's what our query would look like using that format:
{
"query_string" : {
"query" : "godfather part",
"default_operator" : "AND"
}
}
Now that you can see how the query string query construction is formatted, we won't go through all the settings here, but we'll look at a couple more of them below so you can get an idea of some different uses.
First, as we mentioned above, when multiple fields are specified in the query, then bool
is used by default. We can instead set multiple field queries to use the disjunction maximum function. To do this, set use_dis_max
to "true" as follows:
{
"query_string" : {
"query" : "godfather mafia",
"fields" : ["title", "description"],
"use_dis_max" : "true"
}
}
In this case, we're looking for "godfather" or "mafia" in either the title or the description field and we've indicated that we want to use disjunction maximum. For a discussion on using boolean versus dismax, have a look at our article on querytime strategies and techniques
Another one we'll have a quick look at here is fuzziness. Fuzziness has a few different settings that you can alter. These include the fuzziness
setting itself where you can change the default from "AUTO" to a specified character count you'd prefer to use, the fuzzy_max_expansions
setting which you can alter from the default of 50 expansions to another number that better suits your queries and document set, and the fuzzy_prefix_length
where you can set the number of characters at the beginning of terms which should not be changed for fuzzy matches. On that one, for example, you may want to set the prefix length as 1 character so that the first character of a term will not be changed for fuzzy matching. The default for prefix length is 0 so all characters in a term are candidates for changing unless you set this option. Here's an example containing these settings:
{
"query_string" : {
"query" : "man~",
"fuzziness" : 2,
"fuzzy_max_expansions" : 10,
"fuzzy_prefix_length" : 1
}
}
In the above example, we're doing a fuzzy match on the term "man", indicated by the ~
special character at the end of the term. We've changed the default fuzziness setting from "AUTO" to 2 specifying that we want to allow up to 2 characters to change. We've also lowered the default expansions from 50 to 10 and we've specified that the first character of the term cannot change.
There are many other settings that are available, which you can read about in the official Elasticsearch documentation for query string queries. It's probably pretty clear now how powerful this query type is.
The primary drawback of query string queries, and why they're recommended only for development, QA testing, and knowledgeable power users, is that they can break easily with a simple typo... One slip of the syntax can yield zero results or send your Elasticsearch instance crunching on some heavy query that consumes all the memory. Because mistakes are easily made, Lucene (which runs under the hood of Elasticsearch) developed the SimpleQueryParser, whose purpose is to parse a string of human-readable text, no matter how poorly formatted, to produce a result. The simple query string query uses this special parser so that it ignores parts of the query that aren't formatted correctly. It's got much of the same functionality of the query string query type, but it's more like a laid-back cousin.
We've already covered a lot in this article so we'll have to save simple query string queries for another day.
Wrapping up
In this article we got deep into the syntax for using the "q" parameter in search, which is a shortcut for performing query string queries in Elasticsearch. We also looked at how to construct full-blown query string queries in JSON format and why and how you might want to change some of the default settings or use other setting options available to you. Query string queries can help you quickly test your Elasticsearch index and can be a boon for power users who want to have maximum functionality directly in the query syntax. Query well.