Implementation of Full Text Search in PostgreSQL 13

4 years ago SETHA THAY 5383

INTRODUCTION

Full-Text Search (FTS) is a method used by search engines to find results in a database. It can be utilized to bootstrap search results on websites like e-commerces, newspapers, blogs, and more. It provides the capability to identify natural-language documents that satisfy a query, and optionally to sort them by relevance to the query.

Specifically, FTS obtains documents, which are database entities containing textual data, that don’t perfectly match the search criteria. This means that when a user searches for “cats and dogs “, an application backed by FTS is able to return results that contain words separately (just “cats” or “dogs”), contain the words in a different order (“dogs and cats”), or contain variants of the words (“cat” or “dog”). This gives applications an advantage in guessing what the user means and returning more relevant results faster.

Using FTS, you can build a more powerful text search engine without introducing extra dependencies on more advanced tools. In this document, we will use PostgreSQL 13 to store 10,000 rows of data to query using FTS and select only the best matches. As the final step, we will implement some performance improvements for full-text search queries.

Full-Text Search vs LIKE Operator

Textual search operators have existed in databases for years. PostgreSQL has ~, ~*, LIKE, and ILIKE operators for textual data types, but they lack many essential properties required by modern information systems:

There is no linguistic support. They cannot easily handle derived words, e.g., satisfies and satisfy.
They provide no ordering (ranking) of search results
They tend to be slow because there is no index support

Full-text indexing allows documents to be preprocessed and an index saved for later rapid searching. FTS supports the following features:

Fuzzy match
Ranking
Phrase search
Support for multiple languages

Environment

The implementation is conducted with the below environment:

Cloud Server(Ubuntu 20.04)
PostgreSQL 13
10,000 records

IMPLEMENTATION

Prerequisite

Before we begin to implement full-text search, we have to install the following tool for populate fake data: Fake2DB (https://github.com/emirozer/fake2db)

Generating Sample Data

In this step, we will use the tool Fake2DB to generate 10,000 rows of data using the below command:

fake2db --rows 10000 --db postgresql --user postgres --password pwd --host localhost --name postgresql111111 --custom name date country currency_code credit_card_full credit_card_provider

Using tsvector and tsquery

Keyword to_tsvector

PostgreSQL uses to_tsvector for creating a list of tokens (the tsvector data type, where ts stands for "text search")

For example, to create a vector for the sentence "the quick brown fox jumped over the lazy dog", we can do the following:

SELECT to_tsvector('The quick brown fox jumped over the lazy dog.');

This query will return a vector where every token is a lexeme (unit of lexical meaning) with pointers (the positions in the document), and where words that carry little meaning, such as articles (the) and conjunctions (and, or) are conveniently omitted

'brown':3 'dog':9 'fox':4 'jump':5 'lazi':8 'quick':2

Keyword to_tsquery

PostgreSQL uses to_tsquery for querying the vector for occurrences of certain words or phrases. To do this, we'll use the @@ operator to check if tsquery matches tsvector.

SELECT to_tsvector ('The quick brown fox jumped over the lazy dog') @@ to_tsquery('fox');
SELECT to_tsvector ('The quick brown fox jumped over the lazy dog') @@ to_tsquery('foxes');
SELECT to_tsvector ('The quick brown fox jumped over the lazy dog') @@ to_tsquery('jumping');

Example of using to to_tsvector and to_tsquery aginst sample data

SELECT name, date, country, currency_code FROM custom where to_tsvector(name) @@ to_tsquery('John')

Search Multiple Columns

You can full-text search on multiple columns using the below query:

SELECT name, date, country, currency_code, credit_card_full FROM custom where to_tsvector (name || ' ' || credit_card_full) @@ to_tsquery('John');

Adding Column to Store tsvector Keyword

if your application is a high read/search application, adding to_tsvector every time in your where clause is not a good design to follow. Instead, you can add a new column to store the token which is extracted by using the to_tsvector function. Below is the step to follow:

Adding a new column to the table

ALTER TABLE custom ADD COLUMN document tsvector;

Update newly created column with token extracted from to_tsvector function

UPDATE custom SET document = to_tsvector (name || ' ' || credit_card_full || ' ' || credit_card_provider);

Query

SELECT name, date, country, currency_code, credit_card_full FROM custom where document @@ to_tsquery('John');

Adding Column with Index

Ultimately, indexes help the database find rows faster by searching using special data structures and algorithms. PostgreSQL has several types of indexes that are suited to particular types of queries. The most relevant ones for this use case are the GiST indexes and GIN indexes. The main difference between them is how fast they can retrieve documents from the table. GIN is slower to build when adding new data, but faster to query; GiST builds faster but requires additional data reads. Because GiST is slower to retrieve data than GIN, we will use the GIN index.

Adding a new column to the table

ALTER TABLE custom ADD COLUMN document_idx tsvector;

Update newly created column with token extracted from to_tsvector function

UPDATE custom SET document_idx = to_tsvector (name || ' ' || credit_card_full || ' ' || credit_card_provider);

Creating an index on the created column

CREATE INDEX document_with_idx ON custom USING GIN (document_idx);

Query

SELECT name, date, country, currency_code, credit_card_full 
FROM custom where document_idx @@ to_tsquery('John');

RANK

Ranking attempts to measure how relevant documents are to a particular query, so that when there are many matches the most relevant ones can be shown first. The ts_rank function is used to get the relevancy score of the match. In simple words, the higher that rank the more relevant the match is. The value of rank varies from 0 to 1. The higher rank indicates a stronger match.

Example 1

select ts_rank (
  to_tsvector ('english', 'This department delivers the actual product to customers'), 
  to_tsquery ('english', 'product')
);

Example 2

select ts_rank (
  to_tsvector ('english', 'This department delivers the actual product to customers'), 
  to_tsquery ('english', 'product & customers')
);

Example 3

select ts_rank (
  to_tsvector ('english', 'This department delivers the actual product to customers'), 
  to_tsquery ('english', 'project')
);

THE HIGHER OF ts_rank VALUE THE MORE RELEVANT OF SEARCH QUERY TO THE DOCUMENT

Query

SELECT name, date, country, currency_code, credit_card_full 
FROM custom where document_idx @@ plainto_tsquery('John') 
ORDER BY ts_rank (document_idx, plainto_tsquery('John'));

WEIGHT

The function setweight can be used to label the entries of a tsvector with a given weight, where a weight is one of the letters A, B, C, or D. This is typically used to mark entries coming from different parts of a document, such a title versus body. Later, this information can be used for ranking search results.

Typically, weights are used to mark words from special areas of the document, like the title or an initial abstract, so they can be treated with more or less importance than words in the document body.

Adding a new column to the table

ALTER TABLE custom ADD COLUMN document_with_weight tsvector;

Update newly created column with token extracted from to_tsvector function with weight

UPDATE custom SET document_with_weight = setweight(to_tsvector(name), 'A') || 
 setweight(to_tsvector(credit_card_full), 'B') || setweight(to_tsvector(credit_card_provider),'C');

Creating an index on the created column

CREATE INDEX document_weight_idx ON custom USING GIN (document_with_weight);

Query

SELECT name, date, country, currency_code, credit_card_full 
FROM custom where document_with_weight @@ plainto_tsquery('John') 
ORDER BY ts_rank (document_with_weight, plainto_tsquery('John')) DESC;

Query with a ranking score

SELECT name, date, country, currency_code, credit_card_full, ts_rank (document_with_weight, plainto_tsquery('John')) 
FROM custom where document_with_weight @@ plainto_tsquery('John') 
ORDER BY ts_rank (document_with_weight, plainto_tsquery('John')) DESC;

TRIGGER

When using a separate column to store the tsvector representation of your documents, it's necessary to create a trigger to update the tsvector column when the document content columns change. The trigger functions automatically compute a tsvector column from one or more textual columns. Having created this trigger, any change in column value will automatically be reflected tsvector value, without the application having to worry about it.

Creating a new function

CREATE FUNCTION custom_tsvector_trigger () RETURNS trigger AS $$
BEGIN
 
   new. document_with_weight = setweight(to_tsvector(new.name), 'A') || 
 setweight (to_tsvector (new. Credit_card_full), 'B') || setweight (to_tsvector (new. Credit_card_provider),'C');
 
 return new;
END
$$ LANGUAGE plpgsql;

Creating a new trigger to execute the function

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE ON custom 
FOR EACH ROW EXECUTE PROCEDURE custom_tsvector_trigger ();

Insert new row to check the result

INSERT INTO custom (name, date, country, currency_code, credit_card_full, credit_card_provider) VALUES ('Sok Pisey','1989-03-01','Cambodia','KH','JCB 16 digit Amy Eaton 3158008305030451 07/24 CVC: 293','JCB 15 digit');

THANK YOU!!!

Find Us @
Facebook
Telegram
Twitter
LinkedIn

About author

SETHA THAY

Software Engineer & Project Manager. I am willing to share IT knowledge, technical experiences and investment to financial freedom. Feel free to ask and contact me.