In his latest Compose Write Stuff article on Mastering PostgreSQL Tools, Lucero Del Alba writes about mastering full-text and phrase search in PostgreSQL 9.6.
Yes, PostgreSQL 9.6 has been finally rolled out on Compose, and with it, a whole set of features and improvements. In this installment of Mastering PostgreSQL Tools we'll see how PostgreSQL's full-text search features, and how they're expanded in version 9.6 for phrase search and word proximity.
The Good Ol' Text Search
You're probably familiar with pattern search, which has been part of the standard SQL since the beginning, and available to every single SQL-powered database:
SELECT column_name FROM table_name WHERE column_name LIKE 'pattern';
That will return the rows where
column_name matches the
pattern. No surprises here.
You have wildcards such as
% (as in
LIKE 'a%' to search for columns that start with "a"), and
_ (as in
LIKE '_r%' to find any values that have an "r" in the second position); and in PostgreSQL you can also use
ILIKE to ignore cases. This is simple enough and, hopefully, something you're fully familiar with.
But PostgreSQL is an RDBMS capable of much more than simply storing and retrieving data. Back in 2012, a set of text search functions and operators were introduced with version 9.2, which often go unnoticed, let's discuss them first and and see how they aid text search.
What Is Full-Text Search Anyway?
(...) full-text search refers to techniques for searching a single computer-stored document or a collection in a full text database; (...) distinguished from searches based on metadata or on parts of the original texts represented in databases (such as titles, abstracts, selected sections, or bibliographical references). Wikipedia
In other words, imagine you have a set of text documents stored in a database. These documents are not just meta-data items like an author name or a country of origin, but rather an abstract for an article, or full-text articles themselves, and you want to find out if certain words are present or not in them.
For example, you may want to search if the nouns "dog" and "fox" are present so if they are in their singular form, you'll find them with the
SELECT * FROM table_name WHERE column_name LIKE '%fox%' AND column_name LIKE '%dog%';
... but you'll also find stuff like "foxtrot" or "Dogville", which is not quite what you intended.
Another problem is that if you search for a word such as "query", and if it's present in its plural form "queries", then you won't find it if you try a simple pattern search with
LIKE, even though the word is, in fact, there. Some of you might be thinking to use regular expressions, and yes, you could do that, regular expressions are incredibly powerful, but also terribly slow.
A more effective way to approach this problem is by getting a semantic vector for all of the words contained in a document, that is, a language-specific representation of such words. So, when you search for a word like "jump", you will match all instances of the word and its tenses, even if you searched for "jumped" or "jumping". Additionally, you won't be searching the full document itself (which is slow), but the vector (which is fast).
That is, in a nutshell, the principle of full-text search. Let's see how it works.
PostgreSQL has two functions that do exactly what we intend to do:
to_tsvectorfor creating a list of tokens (the
tsvectordata type, where
tsstands for "text search");
to_tsqueryfor querying the vector for occurrences of certain words or phrases.
For example, to create a vector for the sentence "the quick brown fox jumped over the lazy dog", we can do the following:
SELECT to_tsvector('The quick brown fox jumped over the lazy dog.');
Which will return a vector where every token is a lexeme (unit of lexical meaning) with pointers (the positions in the document), and where words that carry little meaning, such as articles (the) and conjunctions (and, or) are conveniently omitted:
to_tsvector ------------------------------------------------------- 'brown':3 'dog':9 'fox':4 'jump':5 'lazi':8 'quick':2
Here, by default, every word is normalized as a lexeme in English (e.g. "jumped" becomes "jump"). But beware, as this might not be the case depending on the localization settings of your PostgreSQL installation. Also, should you be working with a different language other than English, PostgreSQL can handle it if you pass it in as an argument:
SELECT to_tsvector('Portuguese', 'Zebras caolhas de Java querem mandar fax para gigante em New York');
And that would return a vector normalized according to the rules of the Portuguese language:
to_tsvector ------------------------------------------------------------------------------------ 'caolh':2 'fax':7 'gigant':9 'jav':4 'mand':6 'new':11 'quer':5 'york':12 'zebr':1
Notice how "gigante" gets normalized to "gigant", as in Portuguese this word can take many forms—gigante (male, singular), giganta (female, singular), gigantes (male and male/female plural), gigantas (female plural). This right here, just so you know, is natural language processing (NLP) in action.
As we'll see later, you can pass actual fields from a database table to
to_tsvector(), just as you would do with any other SQL function.
Next thing in order to do full-text search, is querying the vector.
The next function that we're interested in, is
to_tsquery(), which accepts a list of words that will be checked against the normalized vector we created with
To do this, we'll use the
@@ operator to check if
tsvector. Let's try it with "fox":
SELECT to_tsvector('The quick brown fox jumped over the lazy dog') @@ to_tsquery('fox');
?column? ---------- t
That returned true (
t). Now with "foxes"...
SELECT to_tsvector('The quick brown fox jumped over the lazy dog') @@ to_tsquery('foxes');
?column? ---------- t
That also returns "true" because "foxes" is the plural form of "fox". But how about "foxtrot"?
SELECT to_tsvector('The quick brown fox jumped over the lazy dog') @@ to_tsquery('foxtrot');
?column? ---------- f
That's false because the search is smart enough not to match anything that simply starts with fox unless it's related to the same semantics (meaning) of the text originally vectorized; which is good because a fox (the animal) has little to do with foxtrot (the dance)!
And finally, now with "jumping":
SELECT to_tsvector('The quick brown fox jumped over the lazy dog') @@ to_tsquery('jumping');
?column? ---------- t
Good! After all, "jumping" is the present continuous form for "to jump".
Operators and Uses
tsquery also provides a set of operators that we would expect in any decent query facility.
You have an AND operator (&):
SELECT to_tsvector('The quick brown fox jumped over the lazy dog') @@ to_tsquery('fox & dog');
That returns true because both "fox" and "dog" are present in the document:
?column? ---------- t
You have OR operator (|):
SELECT to_tsvector('The quick brown fox jumped over the lazy dog') @@ to_tsquery('fox | clown');
Which returns also true because at least the word "fox" is present in the document:
?column? ---------- t
A NEGATION operator (!) is also available. This is how we check if the word "clown" is absent:
SELECT to_tsvector('The quick brown fox jumped over the lazy dog') @@ to_tsquery('!clown');
Which returns also "true":
?column? ---------- t
And we can, of course, combine them all. Let's query "fox AND (dog OR clown) AND NOT queen" (parenthesis aren't really necessary in this particular case, but they help to understand the query):
SELECT to_tsvector('The quick brown fox jumped over the lazy dog') @@ to_tsquery('fox & (dog | clown) & !queen');
All conditions met:
?column? ---------- t
All of what we've seen so far turns PostgreSQL, effectively, into a full-text search engine. But wait, there's more!
The New and Improved Phrase Search
So far you can find if a word or a set of words are present in a document, but what if you wanted to search for a phrase?
Say you're searching for something like "jumping quickly" or "to jump very quickly", if you search for
(jump & quick) with the techniques described so far you might find documents containing these words, but you'll get any random configuration within the document no matter if they are syntactically connected or not. To address this issue, PostgreSQL 9.6 introduced the
<-> operator to further enhance
tsquery by allowing you to search for word proximity or distance. Let's see how it works with some examples.
Let's Create the Documents
To make it simple let's say we have the following documents:
id | text | tokens ---|------------------------------------------|----------------------------------------------------------- 1 | Pack my box with five dozen liquor jugs. | 'box':3 'dozen':6 'five':5 'jug':8 'liquor':7 'pack':1 2 | Jackdaws love my big sphinx of quartz. | 'big':4 'jackdaw':1 'love':2 'quartz':7 'sphinx':5 3 | The five boxing wizards jump quickly. | 'box':3 'five':2 'jump':5 'quick':6 'wizard':4 4 | How vexingly quick daft zebras jump! | 'daft':4 'jump':6 'quick':3 'vex':2 'zebra':5 5 | Bright vixens jump; dozy fowl quack. | 'bright':1 'dozi':4 'fowl':5 'jump':3 'quack':6 'vixen':2
So let's create a table for storing all of this (notice the
tsvector data type for the
CREATE TABLE documents ( document_id SERIAL, document_text TEXT, document_tokens TSVECTOR, CONSTRAINT documents_pkey PRIMARY KEY (document_id) )
Now let's insert the documents into it:
INSERT INTO documents (document_text) VALUES ('Pack my box with five dozen liquor jugs.'), ('Jackdaws love my big sphinx of quartz.'), ('The five boxing wizards jump quickly.'), ('How vexingly quick daft zebras jump!'), ('Bright vixens jump; dozy fowl quack.'), ('Sphinx of black quartz, judge my vow.');
INSERT 0 6
Finally, a little
UPDATE command will conveniently populate the tokens column with their respective vector for each document:
UPDATE documents d1 SET document_tokens = to_tsvector(d1.document_text) FROM documents d2;
Let's Phrase Search the Documents
Now that we got everything in place, let's go back to our example of "jumping quickly" with a search:
SELECT document_id, document_text FROM documents WHERE document_tokens @@ to_tsquery('jump & quick');
As we mentioned,
to_tsquery with the
AND operator doesn't make any distinction in regards to the location of words in the documents:
document_id | document_text -------------+--------------------------------------- 3 | The five boxing wizards jump quickly. 4 | How vexingly quick daft zebras jump! (2 rows)
Let's try it now with the proximity operator
SELECT document_id, document_text FROM documents WHERE document_tokens @@ to_tsquery('jump <-> quick');
That does change things:
document_id | document_text -------------+--------------------------------------- 3 | The five boxing wizards jump quickly. (1 row)
So you can now find words next to each other, but can you find words "close" to each other even if one doesn't come immediately after the other? In fact, the dash
- in the proximity operator
<-> is a placeholder for the amount of proximity you're searching for. Let's give some examples:
Let's search for "sphinx" and "quartz" next to each other (
SELECT * FROM documents WHERE document_tokens @@ to_tsquery('sphinx <-> quartz');
document_id | document_text -------------+-------------- (0 rows)
Let's increase the proximity between "sphinx" and "quartz" to two words apart (
SELECT * FROM documents WHERE document_tokens @@ to_tsquery('sphinx <2> quartz');
document_id | document_text -------------+--------------------------------------- 2 | Jackdaws love my big sphinx of quartz. (1 row)
And three words apart (
SELECT * FROM documents WHERE document_tokens @@ to_tsquery('sphinx <3> quartz');
document_id | document_text -------------+-------------------------------------- 6 | Sphinx of black quartz, judge my vow. (1 row)
A word of caution when performing proximity search. Unlike text-search where
(jump & quick) and
(quick & jump) would yield the same results, phrase search is not symmetric! That is, searching for
(jump <-> quick) is not the same as searching for
(quick <-> jump) as the PostgreSQL engine will consider the order in which you're placing the words, so be careful.
And just so you know,
<-> is really syntactic sugar for the
tsquery_phrase() function; so
to_tsquery('sphinx <3> quartz') is equivalent to
tsquery_phrase('sphinx', 'quartz', 3)
Creating and Storing the
tsvector Data Type
Whether you search for full-text or phrases, you'll need a
tsvector for the document you intend to search. As we've just seen, storing the vectors provides the convenience of having them available for whenever we need them, plus we don't pay the cost of vectorizing every time.
To create a
tsvector you can either use the
to_tsvector() function with an input text, and storing the result into a table with the
INSERT INTO command, just like this:
INSERT INTO documents (document_text, document_tokens) VALUES ('Pack my box with five dozen liquor jugs.', to_tsvector('Pack my box with five dozen liquor jugs.'));
You can also take the input text from one table to store the resulting
tsvector into another table, or into the same one with the
UPDATE command; just as we did in the "Let's Create the Documents" section of this article.
A final option is to create the vector on-the-fly, which is fine for small applications or for testing things out, but the CPU usage will add-up if you perform thousands of these operations. Here's how you do it, also with the
SELECT document_id, document_text FROM documents WHERE to_tsvector(document_text) @@ to_tsquery('jump & quick');
... Even More Features!
This is an introductory article and we can't cover in detail every aspect of full-text search; still, here's a quick overview of the ways in which you can still expand the PostgreSQL's possibilities in this area.
Natural language processing deals with the complexities of human language and is an entire field of study on its own. One of such complexities is that words that would normally have different meanings in most contexts, may be synonyms in a very specific field, and vice-versa.
Actually, consider the word "field", which could easily be replaced by "territory", "ground", and some others when talking about men performing some sort of operations on land. However, "box" could be a more appropriate synonym when talking about an input source, or "area" or "realm" would work better when talking about "field" as a domain of knowledge. And, even all of that would be invalid when talking about about physics!
To better deal with these intricacies, PostgreSQL allows us to build text-search dictionaries so that you can map certain words to field-specific synonyms for your app with
CREATE TEXT SEARCH DICTIONARY. It is a laborious process that would require a deep understanding of the field-specific vocabulary, but if your application actually requires it, PostgreSQL can do it.
Of course, it wouldn't be PostgreSQL if it didn't allow for fine-tuned configurations. Regarding text search, you can configure certain behaviors of the text parser when creating a
tsvector, the dictionary files to use, the template for synonyms, the so called "stop words" , even the integration with Ispell to handle spelling and typographical errors, and more.
Have a look at the
TEXT SEARCH CONFIGURATION instructions and see some configuration examples from the PostgreSQL documentation.
Weights and Rankings
How about if you needed to rank certain terms, phrases, and documents higher (or lower) than others? Say, you might like to somehow penalize duplication of content with a lower rank (and therefore, a lower position in the results set); while at the same time promoting original content or documents you deem more important for certain search terms.
Well... it's PostgreSQL we're talkin' about, so of course you can do it! The technical details for doing this, however, are a bit out of the scope for this article, but you can check the PostgreSQL documentation for controlling text search (where the functions of interest are
ts_rank()), and read Tim van der Linden's fine piece about these features.
PostgreSQL full-text search capabilities are super powerful, but as with nearly anything, you'll need to take into account a couple of things.
Be careful when working with heavily internationalized applications, because even though PostgreSQL supports multiple languages, you'll need to make sure that the settings for the languages in question (dictionaries, templates) are available. You can run
\dF in the PostgreSQL prompt:
For a standard 9.6 installation:
Schema | Name | Description ------------+------------+--------------------------------------- pg_catalog | danish | configuration for danish language pg_catalog | dutch | configuration for dutch language pg_catalog | english | configuration for english language pg_catalog | finnish | configuration for finnish language pg_catalog | french | configuration for french language pg_catalog | german | configuration for german language pg_catalog | hungarian | configuration for hungarian language pg_catalog | italian | configuration for italian language pg_catalog | norwegian | configuration for norwegian language pg_catalog | portuguese | configuration for portuguese language pg_catalog | romanian | configuration for romanian language pg_catalog | russian | configuration for russian language pg_catalog | simple | simple configuration pg_catalog | spanish | configuration for spanish language pg_catalog | swedish | configuration for swedish language pg_catalog | turkish | configuration for turkish language (16 rows)
Notice that while many Indo-European languages are available, such as English, German, Spanish, and Russian, there are some remarkable misses out of this family group, such as Chinese and Japanese.
The reason why full-text search works really fast is because of the
tsvector data type, which works as an index for the document's context. That being said, the cost of the operation is generating this index, which is something you would normally need to do only once (unless the document gets updated).
A good practice, therefore, is to store the vectors alongside with the documents, just as we did in our phrase search example. This way, you can profit from the speedup and flexibility of the
tsquery pair, while paying the small cost of generating and storing the document tokens.
Also, for setting your own dictionaries and templates, you'll need access to the
postgres.conf file, which isn't always granted.
The full-text and phrase search features in PostgreSQL are very powerful and fast. And while setting a fine-tuned search engine will take some work, you go to keep in mind that this is a fairly advanced feature we're discussing, that not long ago it used to take a whole team of programmers and an extensive codebase. PostgreSQL already did the heavy lifting for you and, comparatively, you only need to tweak minor aspects to adapt it tightly to your needs.
And even without tweaking, you can still use
tsquery out-of-the-box to very decent results for relatively simple applications, with a lot more flexibility than you would have by searching patterns with
LIKE, and —in many cases— with simpler, cleaner code.
Should you want to build the next Google, you'll certainly need more than PostgreSQL, for a very extensive array of apps, now you have a very powerful and easy way to implement full-text and phrase search solutions.
This article has been published as part of Compose's Write Stuff program - if you want to write about databases or share your experiences with database technology, we invite you to participate.
attribution Lysander Yuen