Mastering PostgreSQL Tools: Full-Text Search and Phrase Search

Published

In his latest Compose Write Stuff article on Mastering PostgreSQL Tools, Lucero Del Alba writes about mastering full-text and phrase search in PostgreSQL 9.6.

Yes, PostgreSQL 9.6 has been finally rolled out on Compose, and with it, a whole set of features and improvements. In this installment of Mastering PostgreSQL Tools we'll see how PostgreSQL's full-text search features, and how they're expanded in version 9.6 for phrase search and word proximity.

The Good Ol' Text Search

You're probably familiar with pattern search, which has been part of the standard SQL since the beginning, and available to every single SQL-powered database:

SELECT column_name FROM table_name WHERE column_name LIKE 'pattern';  

That will return the rows where column_name matches the pattern. No surprises here.

You have wildcards such as % (as in LIKE 'a%' to search for columns that start with "a"), and _ (as in LIKE '_r%' to find any values that have an "r" in the second position); and in PostgreSQL you can also use ILIKE to ignore cases. This is simple enough and, hopefully, something you're fully familiar with.

But PostgreSQL is an RDBMS capable of much more than simply storing and retrieving data. Back in 2012, a set of text search functions and operators were introduced with version 9.2, which often go unnoticed, let's discuss them first and and see how they aid text search.

What Is Full-Text Search Anyway?

(...) full-text search refers to techniques for searching a single computer-stored document or a collection in a full text database; (...) distinguished from searches based on metadata or on parts of the original texts represented in databases (such as titles, abstracts, selected sections, or bibliographical references). Wikipedia

In other words, imagine you have a set of text documents stored in a database. These documents are not just meta-data items like an author name or a country of origin, but rather an abstract for an article, or full-text articles themselves, and you want to find out if certain words are present or not in them.

For example, you may want to search if the nouns "dog" and "fox" are present so if they are in their singular form, you'll find them with the LIKE keyword...

SELECT * FROM table_name  
WHERE  
    column_name LIKE '%fox%' AND
    column_name LIKE '%dog%';

... but you'll also find stuff like "foxtrot" or "Dogville", which is not quite what you intended.

Another problem is that if you search for a word such as "query", and if it's present in its plural form "queries", then you won't find it if you try a simple pattern search with LIKE, even though the word is, in fact, there. Some of you might be thinking to use regular expressions, and yes, you could do that, regular expressions are incredibly powerful, but also terribly slow.

A more effective way to approach this problem is by getting a semantic vector for all of the words contained in a document, that is, a language-specific representation of such words. So, when you search for a word like "jump", you will match all instances of the word and its tenses, even if you searched for "jumped" or "jumping". Additionally, you won't be searching the full document itself (which is slow), but the vector (which is fast).

That is, in a nutshell, the principle of full-text search. Let's see how it works.

Enter tsvector

PostgreSQL has two functions that do exactly what we intend to do:

For example, to create a vector for the sentence "the quick brown fox jumped over the lazy dog", we can do the following:

SELECT to_tsvector('The quick brown fox jumped over the lazy dog.');  

Which will return a vector where every token is a lexeme (unit of lexical meaning) with pointers (the positions in the document), and where words that carry little meaning, such as articles (the) and conjunctions (and, or) are conveniently omitted:

                      to_tsvector
-------------------------------------------------------
 'brown':3 'dog':9 'fox':4 'jump':5 'lazi':8 'quick':2

Here, by default, every word is normalized as a lexeme in English (e.g. "jumped" becomes "jump"). But beware, as this might not be the case depending on the localization settings of your PostgreSQL installation. Also, should you be working with a different language other than English, PostgreSQL can handle it if you pass it in as an argument:

SELECT to_tsvector('Portuguese', 'Zebras caolhas de Java querem mandar fax para gigante em New York');  

And that would return a vector normalized according to the rules of the Portuguese language:

                                    to_tsvector
------------------------------------------------------------------------------------
 'caolh':2 'fax':7 'gigant':9 'jav':4 'mand':6 'new':11 'quer':5 'york':12 'zebr':1

Notice how "gigante" gets normalized to "gigant", as in Portuguese this word can take many forms—gigante (male, singular), giganta (female, singular), gigantes (male and male/female plural), gigantas (female plural). This right here, just so you know, is natural language processing (NLP) in action.

As we'll see later, you can pass actual fields from a database table to to_tsvector(), just as you would do with any other SQL function.

Next thing in order to do full-text search, is querying the vector.

Enter tsquery

The next function that we're interested in, is to_tsquery(), which accepts a list of words that will be checked against the normalized vector we created with to_tsvector().

To do this, we'll use the @@ operator to check if tsquery matches tsvector. Let's try it with "fox":

SELECT to_tsvector('The quick brown fox jumped over the lazy dog')  
    @@ to_tsquery('fox');
 ?column?
----------
 t

That returned true (t). Now with "foxes"...

SELECT to_tsvector('The quick brown fox jumped over the lazy dog')  
    @@ to_tsquery('foxes');
 ?column?
----------
 t

That also returns "true" because "foxes" is the plural form of "fox". But how about "foxtrot"?

SELECT to_tsvector('The quick brown fox jumped over the lazy dog')  
    @@ to_tsquery('foxtrot');
 ?column?
----------
 f

That's false because the search is smart enough not to match anything that simply starts with fox unless it's related to the same semantics (meaning) of the text originally vectorized; which is good because a fox (the animal) has little to do with foxtrot (the dance)!

And finally, now with "jumping":

SELECT to_tsvector('The quick brown fox jumped over the lazy dog')  
    @@ to_tsquery('jumping');
 ?column?
----------
 t

Good! After all, "jumping" is the present continuous form for "to jump".

Operators and Uses

tsquery also provides a set of operators that we would expect in any decent query facility.

You have an AND operator (&):

SELECT to_tsvector('The quick brown fox jumped over the lazy dog')  
    @@ to_tsquery('fox & dog');

That returns true because both "fox" and "dog" are present in the document:

 ?column?
----------
 t

You have OR operator (|):

SELECT to_tsvector('The quick brown fox jumped over the lazy dog')  
    @@ to_tsquery('fox | clown');

Which returns also true because at least the word "fox" is present in the document:

 ?column?
----------
 t

A NEGATION operator (!) is also available. This is how we check if the word "clown" is absent:

SELECT to_tsvector('The quick brown fox jumped over the lazy dog')  
    @@ to_tsquery('!clown');

Which returns also "true":

 ?column?
----------
 t

And we can, of course, combine them all. Let's query "fox AND (dog OR clown) AND NOT queen" (parenthesis aren't really necessary in this particular case, but they help to understand the query):

SELECT to_tsvector('The quick brown fox jumped over the lazy dog')  
    @@ to_tsquery('fox & (dog | clown) & !queen');

All conditions met:

 ?column?
----------
 t

All of what we've seen so far turns PostgreSQL, effectively, into a full-text search engine. But wait, there's more!

The New and Improved Phrase Search

So far you can find if a word or a set of words are present in a document, but what if you wanted to search for a phrase?

Say you're searching for something like "jumping quickly" or "to jump very quickly", if you search for (jump & quick) with the techniques described so far you might find documents containing these words, but you'll get any random configuration within the document no matter if they are syntactically connected or not. To address this issue, PostgreSQL 9.6 introduced the <-> operator to further enhance tsquery by allowing you to search for word proximity or distance. Let's see how it works with some examples.

Let's Create the Documents

To make it simple let's say we have the following documents:

id | text                                     | tokens  
---|------------------------------------------|-----------------------------------------------------------
 1 | Pack my box with five dozen liquor jugs. | 'box':3 'dozen':6 'five':5 'jug':8 'liquor':7 'pack':1
 2 | Jackdaws love my big sphinx of quartz.   | 'big':4 'jackdaw':1 'love':2 'quartz':7 'sphinx':5
 3 | The five boxing wizards jump quickly.    | 'box':3 'five':2 'jump':5 'quick':6 'wizard':4
 4 | How vexingly quick daft zebras jump!     | 'daft':4 'jump':6 'quick':3 'vex':2 'zebra':5
 5 | Bright vixens jump; dozy fowl quack.     | 'bright':1 'dozi':4 'fowl':5 'jump':3 'quack':6 'vixen':2

So let's create a table for storing all of this (notice the tsvector data type for the document_tokens column):

CREATE TABLE documents  
(
    document_id SERIAL,
    document_text TEXT,
    document_tokens TSVECTOR,

    CONSTRAINT documents_pkey PRIMARY KEY (document_id)
)

Now let's insert the documents into it:

INSERT INTO documents (document_text) VALUES  
('Pack my box with five dozen liquor jugs.'),
('Jackdaws love my big sphinx of quartz.'),
('The five boxing wizards jump quickly.'),
('How vexingly quick daft zebras jump!'),
('Bright vixens jump; dozy fowl quack.'),
('Sphinx of black quartz, judge my vow.');
INSERT 0 6

Finally, a little UPDATE command will conveniently populate the tokens column with their respective vector for each document:

UPDATE documents d1  
SET document_tokens = to_tsvector(d1.document_text)  
FROM documents d2;  
UPDATE 6

Let's Phrase Search the Documents

Now that we got everything in place, let's go back to our example of "jumping quickly" with a search:

SELECT document_id, document_text FROM documents  
WHERE document_tokens @@ to_tsquery('jump & quick');  

As we mentioned, to_tsquery with the AND operator doesn't make any distinction in regards to the location of words in the documents:

 document_id |             document_text
-------------+---------------------------------------
           3 | The five boxing wizards jump quickly.
           4 | How vexingly quick daft zebras jump!
(2 rows)

Let's try it now with the proximity operator <->:

SELECT document_id, document_text FROM documents  
WHERE document_tokens @@ to_tsquery('jump <-> quick');  

That does change things:

 document_id |             document_text
-------------+---------------------------------------
           3 | The five boxing wizards jump quickly.
(1 row)

So you can now find words next to each other, but can you find words "close" to each other even if one doesn't come immediately after the other? In fact, the dash - in the proximity operator <-> is a placeholder for the amount of proximity you're searching for. Let's give some examples:

Let's search for "sphinx" and "quartz" next to each other (<->):

SELECT * FROM documents  
WHERE document_tokens @@ to_tsquery('sphinx <-> quartz');  

No results:

 document_id | document_text
-------------+--------------
(0 rows)

Let's increase the proximity between "sphinx" and "quartz" to two words apart (<2>):

SELECT * FROM documents  
WHERE document_tokens @@ to_tsquery('sphinx <2> quartz');  

Precisely:

 document_id |             document_text
-------------+---------------------------------------
           2 | Jackdaws love my big sphinx of quartz.
(1 row)

And three words apart (<3>):

SELECT * FROM documents  
WHERE document_tokens @@ to_tsquery('sphinx <3> quartz');  

As expected:

 document_id |             document_text
-------------+--------------------------------------
           6 | Sphinx of black quartz, judge my vow.
(1 row)

A word of caution when performing proximity search. Unlike text-search where (jump & quick) and (quick & jump) would yield the same results, phrase search is not symmetric! That is, searching for (jump <-> quick) is not the same as searching for (quick <-> jump) as the PostgreSQL engine will consider the order in which you're placing the words, so be careful.

And just so you know, <-> is really syntactic sugar for the tsquery_phrase() function; so to_tsquery('sphinx <3> quartz') is equivalent to tsquery_phrase('sphinx', 'quartz', 3)

Creating and Storing the tsvector Data Type

Whether you search for full-text or phrases, you'll need a tsvector for the document you intend to search. As we've just seen, storing the vectors provides the convenience of having them available for whenever we need them, plus we don't pay the cost of vectorizing every time.

To create a tsvector you can either use the to_tsvector() function with an input text, and storing the result into a table with the INSERT INTO command, just like this:

INSERT INTO documents (document_text, document_tokens)  
VALUES ('Pack my box with five dozen liquor jugs.',  
        to_tsvector('Pack my box with five dozen liquor jugs.'));

You can also take the input text from one table to store the resulting tsvector into another table, or into the same one with the UPDATE command; just as we did in the "Let's Create the Documents" section of this article.

A final option is to create the vector on-the-fly, which is fine for small applications or for testing things out, but the CPU usage will add-up if you perform thousands of these operations. Here's how you do it, also with the to_tsvector() function:

SELECT document_id, document_text FROM documents  
WHERE to_tsvector(document_text) @@ to_tsquery('jump & quick');  

... Even More Features!

This is an introductory article and we can't cover in detail every aspect of full-text search; still, here's a quick overview of the ways in which you can still expand the PostgreSQL's possibilities in this area.

Search Dictionaries

Natural language processing deals with the complexities of human language and is an entire field of study on its own. One of such complexities is that words that would normally have different meanings in most contexts, may be synonyms in a very specific field, and vice-versa.

Actually, consider the word "field", which could easily be replaced by "territory", "ground", and some others when talking about men performing some sort of operations on land. However, "box" could be a more appropriate synonym when talking about an input source, or "area" or "realm" would work better when talking about "field" as a domain of knowledge. And, even all of that would be invalid when talking about about physics!

To better deal with these intricacies, PostgreSQL allows us to build text-search dictionaries so that you can map certain words to field-specific synonyms for your app with CREATE TEXT SEARCH DICTIONARY. It is a laborious process that would require a deep understanding of the field-specific vocabulary, but if your application actually requires it, PostgreSQL can do it.

Search Configurations

Of course, it wouldn't be PostgreSQL if it didn't allow for fine-tuned configurations. Regarding text search, you can configure certain behaviors of the text parser when creating a tsvector, the dictionary files to use, the template for synonyms, the so called "stop words" , even the integration with Ispell to handle spelling and typographical errors, and more.

Have a look at the TEXT SEARCH CONFIGURATION instructions and see some configuration examples from the PostgreSQL documentation.

Weights and Rankings

How about if you needed to rank certain terms, phrases, and documents higher (or lower) than others? Say, you might like to somehow penalize duplication of content with a lower rank (and therefore, a lower position in the results set); while at the same time promoting original content or documents you deem more important for certain search terms.

Well... it's PostgreSQL we're talkin' about, so of course you can do it! The technical details for doing this, however, are a bit out of the scope for this article, but you can check the PostgreSQL documentation for controlling text search (where the functions of interest are setweight() and ts_rank()), and read Tim van der Linden's fine piece about these features.

Caveats

PostgreSQL full-text search capabilities are super powerful, but as with nearly anything, you'll need to take into account a couple of things.

Languages

Be careful when working with heavily internationalized applications, because even though PostgreSQL supports multiple languages, you'll need to make sure that the settings for the languages in question (dictionaries, templates) are available. You can run \dF in the PostgreSQL prompt:

postgres=# \dF

For a standard 9.6 installation:

   Schema   |    Name    |              Description
------------+------------+---------------------------------------
 pg_catalog | danish     | configuration for danish language
 pg_catalog | dutch      | configuration for dutch language
 pg_catalog | english    | configuration for english language
 pg_catalog | finnish    | configuration for finnish language
 pg_catalog | french     | configuration for french language
 pg_catalog | german     | configuration for german language
 pg_catalog | hungarian  | configuration for hungarian language
 pg_catalog | italian    | configuration for italian language
 pg_catalog | norwegian  | configuration for norwegian language
 pg_catalog | portuguese | configuration for portuguese language
 pg_catalog | romanian   | configuration for romanian language
 pg_catalog | russian    | configuration for russian language
 pg_catalog | simple     | simple configuration
 pg_catalog | spanish    | configuration for spanish language
 pg_catalog | swedish    | configuration for swedish language
 pg_catalog | turkish    | configuration for turkish language
(16 rows)

Notice that while many Indo-European languages are available, such as English, German, Spanish, and Russian, there are some remarkable misses out of this family group, such as Chinese and Japanese.

Performance Concerns

The reason why full-text search works really fast is because of the tsvector data type, which works as an index for the document's context. That being said, the cost of the operation is generating this index, which is something you would normally need to do only once (unless the document gets updated).

A good practice, therefore, is to store the vectors alongside with the documents, just as we did in our phrase search example. This way, you can profit from the speedup and flexibility of the tsvector/tsquery pair, while paying the small cost of generating and storing the document tokens.

Settings

Also, for setting your own dictionaries and templates, you'll need access to the postgres.conf file, which isn't always granted.

Wrap-Up

The full-text and phrase search features in PostgreSQL are very powerful and fast. And while setting a fine-tuned search engine will take some work, you go to keep in mind that this is a fairly advanced feature we're discussing, that not long ago it used to take a whole team of programmers and an extensive codebase. PostgreSQL already did the heavy lifting for you and, comparatively, you only need to tweak minor aspects to adapt it tightly to your needs.

And even without tweaking, you can still use tsvector and tsquery out-of-the-box to very decent results for relatively simple applications, with a lot more flexibility than you would have by searching patterns with LIKE, and —in many cases— with simpler, cleaner code.

Should you want to build the next Google, you'll certainly need more than PostgreSQL, for a very extensive array of apps, now you have a very powerful and easy way to implement full-text and phrase search solutions.


This article has been published as part of Compose's Write Stuff program - if you want to write about databases or share your experiences with database technology, we invite you to participate.

Lucero dances, [plays music](http://luzdealba.bandcamp.com/), writes about random topics, leads projects to varying and doubtful degrees of success, and keeps trying to be real and he keeps failing.

attribution Lysander Yuen

This article is licensed with CC-BY-NC-SA 4.0 by Compose.

Conquer the Data Layer

Spend your time developing apps, not managing databases.