Text search with PostgreSQL
Imagine a simple table with famous quotes:
CREATE TABLE quotes (
id SERIAL NOT NULL PRIMARY KEY,
value VARCHAR
);
The easiest way to search through them with PostgreSQL is by using LIKE
or ILIKE
(case-sensitive / case-insensitive) operators with wild card symbol:
SELECT * FROM quotes WHERE value ILIKE 'life%';
or
SELECT * FROM quotes WHERE value ILIKE '%life%';
For simple phrases and small data set it might be sufficient. Unfortunately for more complex searches and larger data, it might cause a few problems:
- not using indexes for wild card phrases (search needs to perform a full table scan)
- not supporting stemming (i.e. searching for rats will not include rat)
- not supporting multiple conditions (i.e. or / and); just another
LIKE
/ILIKE
condition is required
The first issue can be solved by using pg_trgm
extension, second and third ones by using ts_vectors
and ts_queries
.
pg_trgm
pg_trgm
extension provides functions for determining a similarity between texts based on trigrams. According to PostgreSQL documentation: “A trigram is a group of three consecutive characters taken from a string. We can measure the similarity of two strings by counting the number of trigrams they share. This simple idea turns out to be very effective for measuring the similarity of words in many natural languages.”.
In order to use it, extension needs to be enabled:
CREATE EXTENSION IF NOT EXISTS pg_trgm;
Then it can be used to create an index:
CREATE INDEX quotes_value_idx ON quotes USING GIN (value gin_trgm_ops);
Analyzing the query shows that for this particular search, it’s faster to perform a sequential scan than use index:
EXPLAIN ANALYZE
SELECT * FROM quotes WHERE value LIKE '%life%';
Seq Scan on quotes (cost=0.00..1.83 rows=1 width=36) (actual time=0.009..0.025 rows=7 loops=1)
Filter: (value ~~ '%life%'::text)
Rows Removed by Filter: 59
Planning Time: 0.222 ms
Execution Time: 0.036 ms
To enforce index usage, disable sequential scan (SET enable_seqscan=false;
) and rerun the query:
Bitmap Heap Scan on quotes (cost=12.00..16.02 rows=1 width=36) (actual time=0.020..0.024 rows=7 loops=1)
Recheck Cond: (value ~~ '%life%'::text)
Rows Removed by Index Recheck: 7
Heap Blocks: exact=1
-> Bitmap Index Scan on quotes_idx (cost=0.00..12.00 rows=1 width=0) (actual time=0.014..0.014 rows=14 loops=1)
Index Cond: (value ~~ '%life%'::text)
Planning Time: 0.110 ms
Execution Time: 0.045 ms
ts_vector and ts_query
ts_vector
creates a list of tokens:
SELECT to_tsvector('The quick brown fox jumped over the lazy dog.');
to_tsvector
-------------------------------------------------------
'brown':3 'dog':9 'fox':4 'jump':5 'lazi':8 'quick':2
ts_query
queries the vector of certain words or phrases:
SELECT to_tsvector('The quick brown fox jumped over the lazy dog') @@ to_tsquery('fox');
?column?
----------
t
ts_vector
can also be used to create an index:
CREATE INDEX quotes_tsvector_idx ON quotes USING GIN (to_tsvector('english', value));
Such index can be used later in search to improve the query performance:
EXPLAIN ANALYZE SELECT value FROM quotes
WHERE to_tsvector('english', value) @@ to_tsquery('life');
Bitmap Heap Scan on quotes (cost=8.25..12.77 rows=1 width=32) (actual time=0.017..0.018 rows=14 loops=1)
Recheck Cond: (to_tsvector('english'::regconfig, value) @@ to_tsquery('life'::text))
Heap Blocks: exact=1
-> Bitmap Index Scan on quotes_tsvector_idx (cost=0.00..8.25 rows=1 width=0) (actual time=0.013..0.013 rows=14 loops=1)
Index Cond: (to_tsvector('english'::regconfig, value) @@ to_tsquery('life'::text))
Planning Time: 0.161 ms
Execution Time: 0.036 ms
tsquery
also supports multiple conditions, i.e. it can be used to search for ‘people’ or ‘person’:
SELECT * FROM quotes
WHERE to_tsvector('english', value) @@ to_tsquery('people | person');
Bitmap Heap Scan on quotes (cost=12.26..16.77 rows=1 width=32) (actual time=0.023..0.024 rows=5 loops=1)
Recheck Cond: (to_tsvector('english'::regconfig, value) @@ to_tsquery('people | person'::text))
Heap Blocks: exact=1
-> Bitmap Index Scan on quotes_tsvector_idx (cost=0.00..12.25 rows=1 width=0) (actual time=0.020..0.020 rows=5 loops=1)
Index Cond: (to_tsvector('english'::regconfig, value) @@ to_tsquery('people | person'::text))
Planning Time: 0.122 ms
Execution Time: 0.047 ms
and performs better than multiple LIKE
conditions:
EXPLAIN ANALYZE
SELECT * FROM quotes WHERE value LIKE '%people%' OR value LIKE '%person%';
Bitmap Heap Scan on quotes (cost=40.00..44.02 rows=1 width=36) (actual time=0.027..0.029 rows=5 loops=1)
Recheck Cond: ((value ~~ '%people%'::text) OR (value ~~ '%person%'::text))
Heap Blocks: exact=1
-> BitmapOr (cost=40.00..40.00 rows=1 width=0) (actual time=0.022..0.022 rows=0 loops=1)
-> Bitmap Index Scan on quotes_idx (cost=0.00..20.00 rows=1 width=0) (actual time=0.014..0.014 rows=4 loops=1)
Index Cond: (value ~~ '%people%'::text)
-> Bitmap Index Scan on quotes_idx (cost=0.00..20.00 rows=1 width=0) (actual time=0.007..0.007 rows=1 loops=1)
Index Cond: (value ~~ '%person%'::text)
Planning Time: 0.146 ms
Execution Time: 0.053 ms
Full examples can be found on dbfidle.