Indexing for full text search in PostgreSQL

Do you have textual content in PostgreSQL that you'd like to be able to search more effectively, but you're not sure you're ready to implement a full-blown search application like Lucene, SOLR, or Elasticsearch? The full text search functionality in PostgreSQL is a great solution for you then! This article will give you an understanding of how PostgreSQL full text search indexing works under the hood and show you the steps involved in creating a full text index.

Intro to Postgres Full Text Search

PostgreSQL full text search is based on the Vector Space Model (VSM) of Information Retrieval. In VSM, documents and queries are each converted into vectors of terms and their characteristics (like frequency, proximity, uniqueness, etc...). Search matches and relevancy ranking are determined by comparing the document and query vectors. This article will focus on the conversion of documents into vectors and the indexing options in PostgreSQL that support document vectors.

A document is a flexible construct that can take advantage of the relational nature of PostgreSQL. What that means is that a document might be stored as one row in a table with multiple fields, a single field in a row, or various rows in various tables. When a document is converted to its vector form, the tsvector data type stores the entire document for search. Conversion of a document to vector form involves parsing the text into tokens, converting tokens to lexemes (using dictionaries), and optimized indexing for quickly performing query functions and retrieving relevant documents.

In this article we'll use the DVD rental sample database that contains some humorously fabricated film data as an example to explore search indexing concepts and options in PostgreSQL.

Now that you have an overview of the basics, let's see how each part works.

Tokenization

Tokenization is the process of splitting text into tokens. There are a variety of tokenizers used by the PostgreSQL default parser. One tokenizer splits text at whitespaces and punctuation so that each word or series of numbers is treated as token. Other tokenizers identify a URL or email address as a token, while still others will recognize xml tags. These are just a few examples of the built-in tokenizers.

Because all the tokenizers are in play during the parsing phase, a piece of text may be tokenized in multiple ways. For example, the text "all-inclusive" would get tokenized to three tokens by the default parser -- "all", "inclusive", and "all-inclusive" -- because there is a tokenizer that recognizes hyphenated words as being one token.

Let's look at a fun example. The "film" table in our DVD rental database contains a couple different fields that are part of the documents we want to be converted to vector form: title and description. We can run the testing utility ts_parse to see how text in these fields will be tokenized by the default parser. We can use one of two methods - either providing the text itself directly to ts_parse or by referencing the field. Note that the default parser is just called "default".

Method 1 - Providing the text directly to ts_parse (note: the typo is part of the data sample):

SELECT *  
FROM ts_parse('default',  
'A Awe-Inspiring Story of a Monkey And a Pastry Chef who must Succumb a Womanizer in A MySQL Convention')  
;

Method 2 - Referencing the field:

 -- this is the field and film used for the description text above
SELECT ts_parse('default', description)  
FROM film  
WHERE title = 'Intolerable Intentions'  
;

Using Method 1 will produce the results we see below (truncated with ". . . ." since you'll get the idea pretty quickly). Method 2 will provide the same data, but concatenates it with a comma (or semi-colon) into a single output field.

tokid | token  
----------------------
1     | A  
12    | " "  
16    | Awe-Inspiring  
11    | Awe  
12    | -  
11    | Inspiring  
12    | " "  
1     | Story  
12    | " "  
1     | of  
12    | " "  
1     | a  
12    | " "  
1     | Monkey  
. . . .

What we see here is the token ID for the type of token identified and each token in the text. Specifically, token ID 12 represents whitespaces and punctuation, token ID 1 represents words, token ID 16 represents hyphenated words, and token ID 11 represents words that are part of a hyphenated word.

Each token is then passed on to be "lexized" -- converted to lexemes using dictionaries.

Lexemes

Lexemes are created from tokens by applying a variety of dictionaries to handle word variants: stopwords, stemmers, synonyms, and other thesaural relationships.

Stopwords are a list of words that you do not care to index or search on. These are typically common words such as "a" or "the". Some custom lists include "forbidden" words as well.

A stemmer breaks a term down to its root form. For example, the term "stemmer" would be broken down to "stem". And because PostgreSQL uses the Porter Snowball stemming algorithm by default, the term can then be "snowballed" to multiple possible endings for matches on "stemmed" and "stemming", besides "stemmer".

In Compose PostgreSQL, you'll get stopwords and stemming automatically in your search configuration. However, because Compose does not permit access to the server file system (that way we can maintain high security for all our customers), we currently do not allow for custom dictionaries (including synonyms and thesauri), custom parsers or custom templates to be created. The default search configuration is quite robust, however, and should do well for typical text searches.

Let's take a couple examples from the DVD rental film "Intolerable Intentions" to see how lexemes are generated. We'll pass each of the title words to the ts_lexize function (which only takes individual tokens) as well as a known stopword so that you can see how lexizing works. Note that even though the configuration used here is "_stem", stopwords are also included by default.

Here's a stopword:

SELECT ts_lexize('english_stem', 'a');  

As you can see, we get a null lexeme returned because the stopwords list indicates that the term "a" should not be included:

{}

Let's now check the words in the film title, starting with "Intolerable":

SELECT ts_lexize('english_stem', 'Intolerable');  

The stemmed lexeme is returned:

{intoler}

And then for "Intentions":

SELECT ts_lexize('english_stem', 'Intentions');  

We get this lexeme:

{intent}

Note that the terms also get lower-cased besides being broken down to their root form for indexing.

Checking and Setting the Search Configuration

The search configuration specifies the parser and dictionaries to be used to convert documents to the tsvector format for indexing. When a query is made, it will also use the search configuration to convert it to vector form.

The default search configuration set by Compose for your deployment is pg_catalog.english. You can verify this by running SHOW for the search configuration setting on the database your documents live in:

SHOW default_text_search_config;  

If you have content in a language other than English, there are a variety of other languages to choose from. We can see the list of languages by running \dF with psql:

dbname=> \dF  

And we get back:

               List of text search configurations
   Schema   |    Name    |              Description
------------+------------+---------------------------------------
 pg_catalog | danish     | configuration for danish language
 pg_catalog | dutch      | configuration for dutch language
 pg_catalog | english    | configuration for english language
 pg_catalog | finnish    | configuration for finnish language
 pg_catalog | french     | configuration for french language
 pg_catalog | german     | configuration for german language
 pg_catalog | hungarian  | configuration for hungarian language
 pg_catalog | italian    | configuration for italian language
 pg_catalog | norwegian  | configuration for norwegian language
 pg_catalog | portuguese | configuration for portuguese language
 pg_catalog | romanian   | configuration for romanian language
 pg_catalog | russian    | configuration for russian language
 pg_catalog | simple     | simple configuration
 pg_catalog | spanish    | configuration for spanish language
 pg_catalog | swedish    | configuration for swedish language
 pg_catalog | turkish    | configuration for turkish language
(16 rows)

Because the search configuration can be set at different levels, if we have content in Spanish, we could do one of the following, depending on our use case:

If you have content in more than one language, Rachid Belaid's blog post on Postgres full text search has a great section on using different configurations for multi-lingual content and also demonstrates how to handle accented characters.

Aside from languages, as you can see from the configuration list above, there is also a "simple" option. "simple" is relatively language-agnostic (depends on the language, of course) since it does not use stopwords or stemming. It simply breaks up text into tokens at the whitespaces.

Let's go back to our "Intolerable Intentions" example and see how the entire search configuration works to transform the film description. We can do that using the ts_debug utility and just like ts_parse, we can use either of the two methods described:. Here we'll use method 1:

SELECT * FROM ts_debug('english',  
'A Awe-Inspiring Story of a Monkey And a Pastry Chef who must Succumb a Womanizer in A MySQL Convention')  
;

The complete results:

alias            | description                      | token          | dictionaries     | dictionary     | lexemes  
-----------------------------------------------------------------------------------------------------------------------
asciiword        | Word, all ASCII                  | A              | {english_stem}   | english_stem   | {}  
blank            | Space symbols                    |                | {}               |                |  
asciihword       | Hyphenated word, all ASCII       | Awe-Inspiring  | {english_stem}   | english_stem   | {awe-inspir}  
hword_asciipart  | Hyphenated word part, all ASCII  | Awe            | {english_stem}   | english_stem   | {awe}  
blank            | Space symbols                    |                | {}               |                |  
hword_asciipart  | Hyphenated word part, all ASCII  | Inspiring      | {english_stem}   | english_stem   | {inspir}  
blank            | Space symbols                    |                | {}               |                |  
asciiword        | Word, all ASCII                  | Story          | {english_stem}   | english_stem   | {stori}  
blank            | Space symbols                    |                | {}               |                |  
asciiword        | Word, all ASCII                  | of             | {english_stem}   | english_stem   | {}  
blank            | Space symbols                    |                | {}               |                |  
asciiword        | Word, all ASCII                  | a              | {english_stem}   | english_stem   | {}  
blank            | Space symbols                    |                | {}               |                |  
asciiword        | Word, all ASCII                  | Monkey         | {english_stem}   | english_stem   | {monkey}  
blank            | Space symbols                    |                | {}               |                |  
asciiword        | Word, all ASCII                  | And            | {english_stem}   | english_stem   | {}  
blank            | Space symbols                    |                | {}               |                |  
asciiword        | Word, all ASCII                  | a              | {english_stem}   | english_stem   | {}  
blank            | Space symbols                    |                | {}               |                |  
asciiword        | Word, all ASCII                  | Pastry         | {english_stem}   | english_stem   | {pastri}  
blank            | Space symbols                    |                | {}               |                |  
asciiword        | Word, all ASCII                  | Chef           | {english_stem}   | english_stem   | {chef}  
blank            | Space symbols                    |                | {}               |                |  
asciiword        | Word, all ASCII                  | who            | {english_stem}   | english_stem   | {}  
blank            | Space symbols                    |                | {}               |                |  
asciiword        | Word, all ASCII                  | must           | {english_stem}   | english_stem   | {must}  
blank            | Space symbols                    |                | {}               |                |  
asciiword        | Word, all ASCII                  | Succumb        | {english_stem}   | english_stem   | {succumb}  
blank            | Space symbols                    |                | {}               |                |  
asciiword        | Word, all ASCII                  | a              | {english_stem}   | english_stem   | {}  
blank            | Space symbols                    |                | {}               |                |  
asciiword        | Word, all ASCII                  | Womanizer      | {english_stem}   | english_stem   | {woman}  
blank            | Space symbols                    |                | {}               |                |  
asciiword        | Word, all ASCII                  | in             | {english_stem}   | english_stem   | {}  
blank            | Space symbols                    |                | {}               |                |  
asciiword        | Word, all ASCII                  | A              | {english_stem}   | english_stem   | {}  
blank            | Space symbols                    |                | {}               |                |  
asciiword        | Word, all ASCII                  | MySQL          | {english_stem}   | english_stem   | {mysql}  
blank            | Space symbols                    |                | {}               |                |  
asciiword        | Word, all ASCII                  | Convention     | {english_stem}   | english_stem   | {convent}  

What we see here is a row for each token specifying the alias of the tokenizer that was applied, the description of the tokenizer, the token identified, the dictionaries applied, the specific dictionary that produced the lexeme, and the final lexeme that will be used for the vector.

So, now that we know what happens under the hood based on the search configuration, let's convert a document to a vector.

Creating a Document Vector

Here's where the function to_tsvector comes into play. This function will take the document text provided and convert it into a vector that will be stored in the tsvector data type. Let's bring back our "Intolerable Intentions" example and run to_tsvector on the document. In this case, our document will just be the description field and the output we'll get, regardless of the method we use, is only a single field so we'll use Method 2:

SELECT to_tsvector('english', description)  
FROM film  
WHERE title = 'Intolerable Intentions'  
;

Here's our document vector for full text search on the description field:

'awe':3 'awe-inspir':2 'chef':12 'convent':21 'inspir':4 'monkey':8 'must':14 'mysql':20 'pastri':11 'stori':5 'succumb':15 'woman':17  

The vector consists of each valid lexeme (the null ones are removed) and the word position in the original text passed to the function.

But what if our document consists of multiple fields (which may or may not live in the same table)? In that case, we can just concatenate the fields together using the "||" operator. In our case, both our fields live in the "film" table, but if they didn't, we could just add a JOIN to our statement and preface the field names with their table names. Here's what our statement looks like to include both the title and the description in our document vector:

SELECT to_tsvector('english', title || description)  
FROM film  
WHERE title = 'Intolerable Intentions'  
;

The document vector we get back now also includes the lexemes from the title and you can see the positions of the description lexemes have shifted since the title was input to the vector first:

'awe':4 'awe-inspir':3 'chef':13 'convent':22 'inspir':5 'intentionsa':2 'intoler':1 'monkey':9 'must':15 'mysql':21 'pastri':12 'stori':6 'succumb':16 'woman':18  

But, what if there is another film that mentions "intolerable intentions" in its description? We might want words in the title to count more toward relevancy ranking than words in the description. To do that, we'll also add field weights to indicate that the title field is more important than the description field. We do that using the setweight function with letters from the latin alphabet in descending order of importance. Note, too, that if you are concerned about possible NULL values in your fields, you can also add a COALESCE as we're showing here:

SELECT setweight(to_tsvector('english', COALESCE(title,'')), 'A') ||  
     setweight(to_tsvector('english', COALESCE(description,'')), 'B')
FROM film  
WHERE title = 'Intolerable Intentions'  
;

Above we're performing the concatenation of the fields after running to_tsvector and setweight since we can't run setweight from within to_tsvector. Our vector now looks like this with the field weights applied:

'awe':5B 'awe-inspir':4B 'chef':14B 'convent':23B 'inspir':6B 'intent':2A 'intoler':1A 'monkey':10B 'must':16B 'mysql':22B 'pastri':13B 'stori':7B 'succumb':17B 'woman':19B  

Something to note here is that because we're concatenating our document fields after running to_tsvector and setweight on them, we could potentially use different search configurations on each piece. For example, we could use the "simple" configuration on the title and the "english" configuration on the description. Here's what that would look like:

SELECT setweight(to_tsvector('simple', COALESCE(title,'')), 'A') ||  
     setweight(to_tsvector('english', COALESCE(description,'')), 'B')
FROM film  
WHERE title = 'Intolerable Intentions'  
;

Our vector is now changed:

'awe':5B 'awe-inspir':4B 'chef':14B 'convent':23B 'inspir':6B 'intentions':2A 'intolerable':1A 'monkey':10B 'must':16B 'mysql':22B 'pastri':13B 'stori':7B 'succumb':17B 'woman':19B  

Because the "simple" configuration does not apply stopwords or stemming and instead maintains tokens exactly as they are (except for lower-casing), we now have "intolerable" and "intentions" in the vector instead of the root forms "intoler" and "intent". This can be a useful approach if there is a portion of your document for which exact matches are likely, such as with people's names.

Storing and Populating Document Vectors

Now that we've explored some options in terms of how our document vectors can be created and discovered the one we like best (we like the one that includes both title and description, with the "english" configuration for both, with field weights applied, and using coalesce), we're ready to convert our film documents to vectors. We'll first create a field in our film table to hold the document vectors using the tsvector datatype:

ALTER TABLE film  
ADD COLUMN weighted_tsv tsvector  
;

We can now update the new column "weighted_tsv" with document vectors:

UPDATE film SET  
    weighted_tsv = x.weighted_tsv
FROM (  
    SELECT film_id,
           setweight(to_tsvector('english', COALESCE(title,'')), 'A') ||
           setweight(to_tsvector('english', COALESCE(description,'')), 'B')
           AS weighted_tsv
     FROM film
) AS x
WHERE x.film_id = film.film_id;  

Now, we'll also want to create a trigger that will automatically create the document vector for any new film that gets added or replace the vector if an existing film gets updated. PostgreSQL comes with a built-in function that could be used in such a trigger (tsvector_update_trigger), but unfortunately, it does not support individual processing of fields, like we're doing to set field weights. For that reason, we'll have to create our own function. The PostgreSQL documentation recognizes this limitation and provides a PL/pgSQL example, which we'll adapt for our purposes here:

CREATE FUNCTION film_weighted_tsv_trigger() RETURNS trigger AS $$  
begin  
  new.weighted_tsv :=
     setweight(to_tsvector('english', COALESCE(new.title,'')), 'A') ||
     setweight(to_tsvector('english', COALESCE(new.description,'')), 'B');
  return new;
end  
$$ LANGUAGE plpgsql;

And now we can create our trigger using our new film_weighted_tsv_trigger() function:

CREATE TRIGGER upd_tsvector BEFORE INSERT OR UPDATE  
ON film  
FOR EACH ROW EXECUTE PROCEDURE film_weighted_tsv_trigger();  

We're now ready for indexing!

Indexing

PostgreSQL provides two index types that can be used to index tsvector data types: GIN and GiST indexes. They each have their pros and cons. Determining which one is best for your document collection will be situational.

GIN (which stands for Generalized Inverted Index) is recommended by PostgreSQL as the default index type for full text searching and specifically for documents which contain a lot of text. That's because the inverted index facilitates rapid matching and retrieval. There are a couple drawbacks, however. The creation of the inverted index involves more substantial document processing up front so index creation will take a bit more time. Also, GIN indexes do not store the field weights so each matching row returned has to be re-checked for a query which applies the field weights.

GiST (which stands for Generalized Search Tree) uses a more traditional database-style index that is tree-like in nature (hence the name) and automatically re-balances itself for keeping retrieval as speedy as it can be. Even though GiST retrieval speed is typically going to be a bit slower than GIN do to the way the index is structured, indexing with GiST happens more quickly than GIN so if having new documents readily-available in search is a key requirement, then GiST will be the best option. The main issue with GiST indexes is that they are "lossy", meaning the document is represented in the index with a fixed-length signature and therefore has the potential for a hash collision. To put it another way, GiST indexing can result in false positive matches if two different terms are hashed to the same bit position in the index. While this is typically unlikely, it is still possible (especially in documents with a large amount of text). For that reason, the returned rows have to be re-checked to verify the match is valid to the query.

For our example, because we want to be able to use field weights efficiently, because our documents are very short, and because we want each newly-entered DVD in the database to be readily-available to search, we're going to opt to use GiST. Here's how we'll create the index:

CREATE INDEX weighted_tsv_idx ON film USING GIST (weighted_tsv);  

And now we are indexed for full text search!

Wrapping Up

We hope this article has given you a solid understanding of full text search indexing concepts and the options available to you in Compose PostgreSQL. If you want to dive in deeper, you can read the official documentation here. In a future article, we'll have a look at search ranking and querying full text fields in PostgreSQL.