Full Text Search Done (Almost) Right in PostgreSQL 11

Long story short, using PostgreSQL 11 with RUM index you can do both TOP-N query and COUNT(*) for non-selective FTS queries without fetching all the results from heap (that means much faster). Are you bored yet? If not, please read the detailed description below.

At November 1st 2017, Tome Lane committed a patch enabling bitmap scans to behave like index-only scan when possible. In particular, since PostgreSQL 11 COUNT(*) queries can be evaluated using bitmap scans without accessing heap when corresponding bit in visibility map is set. This patch was written by Alexander Kuzmenkov and reviewed by Alexey Chernyshov (sboth are my Postgres Pro colleagues), and it was heavily revised by Tom Lane.

This commit might seem to be just one of planner and executor optimizations, nice but doesn’t deserve much attention. However, under detailed consideration this patch appears to be significant improvement on the way of making full text search in PostgreSQL to be done the right way.

I’ve started working on FTS improvements in 2012. That time I realized that GIN index is good for selective FTS queries, when number of matching results is low. See the example below: GIN did great work for us by returning just few dozens of matching rows very fast. The rest operations including relevance calculation and sorting are also fast, because they are performed over very small row set.

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM pgmail
WHERE fts @@ plainto_tsquery('english', 'exclusion constraint')
ORDER BY ts_rank_cd(fts, plainto_tsquery('english', 'exclusion constraint')) DESC
LIMIT 10;
                                                               QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=144.26..144.28 rows=10 width=784) (actual time=320.142..320.149 rows=10 loops=1)
   Buffers: shared hit=7138 read=7794
   ->  Sort  (cost=144.26..144.32 rows=25 width=784) (actual time=320.141..320.147 rows=10 loops=1)
         Sort Key: (ts_rank_cd(fts, '''exclus'' & ''constraint'''::tsquery)) DESC
         Sort Method: top-N heapsort  Memory: 38kB
         Buffers: shared hit=7138 read=7794
         ->  Bitmap Heap Scan on pgmail  (cost=44.20..143.72 rows=25 width=784) (actual time=5.232..315.302 rows=3357 loops=1)
               Recheck Cond: (fts @@ '''exclus'' & ''constraint'''::tsquery)
               Heap Blocks: exact=2903
               Buffers: shared hit=7138 read=7794
               ->  Bitmap Index Scan on pgmail_fts_idx  (cost=0.00..44.19 rows=25 width=0) (actual time=3.689..3.689 rows=3357 loops=1)
                     Index Cond: (fts @@ '''exclus'' & ''constraint'''::tsquery)
                     Buffers: shared hit=11 read=23
 Planning time: 0.176 ms
 Execution time: 320.213 ms
(15 rows)

But situation is different if FTS query is not selective and number of matching rows is high. Then we have fetch all those rows from heap, calculate relevance for each of them and sort them. And despite we only need TOP-10 rows, this query takes a lot of time.

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM pgmail
WHERE fts @@ plainto_tsquery('english', 'Tom Lane')
ORDER BY ts_rank_cd(fts, plainto_tsquery('english', 'Tom Lane')) DESC
LIMIT 10;
                                                                 QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=144.26..144.28 rows=10 width=784) (actual time=18110.231..18110.236 rows=10 loops=1)
   Buffers: shared hit=1358323 read=399077
   ->  Sort  (cost=144.26..144.32 rows=25 width=784) (actual time=18110.229..18110.231 rows=10 loops=1)
         Sort Key: (ts_rank_cd(fts, '''tom'' & ''lane'''::tsquery)) DESC
         Sort Method: top-N heapsort  Memory: 44kB
         Buffers: shared hit=1358323 read=399077
         ->  Bitmap Heap Scan on pgmail  (cost=44.20..143.72 rows=25 width=784) (actual time=70.267..17895.628 rows=224568 loops=1)
               Recheck Cond: (fts @@ '''tom'' & ''lane'''::tsquery)
               Rows Removed by Index Recheck: 266782
               Heap Blocks: exact=39841 lossy=79307
               Buffers: shared hit=1358323 read=399077
               ->  Bitmap Index Scan on pgmail_fts_idx  (cost=0.00..44.19 rows=25 width=0) (actual time=63.914..63.914 rows=224568 loops=1)
                     Index Cond: (fts @@ '''tom'' & ''lane'''::tsquery)
                     Buffers: shared hit=41 read=102
 Planning time: 0.131 ms
 Execution time: 18110.376 ms
(16 rows)(15 rows)

How can we improve this situation? If we would get results from index pre-ordered by relevance, then we would be able to evaluate TOP-N query without fetching the whole set of matching rows from heap. Unfortunately, that appears to be impossible for GIN index which stores only facts of occurence of specifix terms in document. But if we have additional infromation about terms positions in the index, then it might work. That information would be enough to calculate relevance only basing on index information.

Thus, I’ve proposed proposed a set of patches to GIN index. Some improvements were committed including index compression and index search optimization. However, additional information storage for GIN index wasn’t committed, because it alters GIN index structure too much.

Fortunately, we have extensible index access methods in PostgreSQL 9.6. And that enables us to implement things, which wasn’t committed to GIN and more, as a separate index access method RUM. Using RUM, one can execute TOP-N FTS query much faster without fetching all the matching rows from heap.

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM pgmail
WHERE fts @@ plainto_tsquery('english', 'Tom Lane')
ORDER BY fts <=> plainto_tsquery('english', 'Tom Lane')
LIMIT 10;
                                                                QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=48.00..83.25 rows=10 width=1523) (actual time=242.974..248.366 rows=10 loops=1)
   Buffers: shared hit=809 read=25, temp read=187 written=552
   ->  Index Scan using pgmail_idx on pgmail  (cost=48.00..193885.14 rows=54984 width=1523) (actual time=242.972..248.358 rows=10 loops=1)
         Index Cond: (fts @@ '''tom'' & ''lane'''::tsquery)
         Order By: (fts <=> '''tom'' & ''lane'''::tsquery)
         Buffers: shared hit=809 read=25, temp read=187 written=552
 Planning time: 14.709 ms
 Execution time: 312.794 ms
(8 rows)

However, the problem persisted if you need to get total count of matching rows. Then PostgreSQL executor still have to fetch all the matching rows from the heap in order to check their visibility. So, if you need total number of resulting rows for pagination, then it’s still might be very slow.

EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*) FROM pgmail
WHERE fts @@ plainto_tsquery('english', 'Tom Lane');
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=118931.46..118931.47 rows=1 width=8) (actual time=36263.708..36263.709 rows=1 loops=1)
   Buffers: shared hit=800692 read=348338
   ->  Bitmap Heap Scan on pgmail  (cost=530.19..118799.14 rows=52928 width=0) (actual time=74.724..36195.946 rows=224568 loops=1)
         Recheck Cond: (fts @@ '''tom'' & ''lane'''::tsquery)
         Rows Removed by Index Recheck: 266782
         Heap Blocks: exact=39841 lossy=79307
         Buffers: shared hit=800692 read=348338
         ->  Bitmap Index Scan on pgmail_fts_idx  (cost=0.00..516.96 rows=52928 width=0) (actual time=67.467..67.467 rows=224568 loops=1)
               Index Cond: (fts @@ '''tom'' & ''lane'''::tsquery)
               Buffers: shared hit=41 read=102
 Planning time: 0.210 ms
 Execution time: 36263.790 ms
(12 rows)

For sure, some modern UIs use techniques like continuous scrolling which doesn’t require to show full number of results to user. Also, one can use planner estimation for number of resulting rows which is typically matching the order of magnitude to actual number of resulting rows. But nevertheless, slow counting of total results number was a problem for many of RUM users.

EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*) FROM pgmail
WHERE fts @@ plainto_tsquery('english', 'Tom Lane');
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=121794.28..121794.29 rows=1 width=8) (actual time=132.336..132.336 rows=1 loops=1)
   Buffers: shared hit=404
   ->  Bitmap Heap Scan on pgmail  (cost=558.13..121656.82 rows=54984 width=0) (actual time=83.676..116.889 rows=224568 loops=1)
         Recheck Cond: (fts @@ '''tom'' & ''lane'''::tsquery)
         Heap Blocks: exact=119148
         Buffers: shared hit=404
         ->  Bitmap Index Scan on pgmail_idx  (cost=0.00..544.38 rows=54984 width=0) (actual time=61.459..61.459 rows=224568 loops=1)
               Index Cond: (fts @@ '''tom'' & ''lane'''::tsquery)
               Buffers: shared hit=398
 Planning time: 0.183 ms
 Execution time: 133.885 ms
(11 rows)

Comments