Full Text Search Done (Almost) Right in PostgreSQL 11
Long story short, using PostgreSQL 11 with RUM index
you can do both TOP-N query and COUNT(*) for non-selective FTS queries without
fetching all the results from heap (that means much faster). Are you bored yet?
If not, please read the detailed description below.
At November 1st 2017, Tome Lane committed a patch
enabling bitmap scans to behave like index-only scan when possible.
In particular, since PostgreSQL 11 COUNT(*) queries can be evaluated using
bitmap scans without accessing heap when corresponding bit in visibility map
is set. This patch was written by Alexander Kuzmenkov and reviewed by
Alexey Chernyshov (sboth are my Postgres Pro colleagues), and it was heavily
revised by Tom Lane.
This commit might seem to be just one of planner and executor optimizations,
nice but doesn’t deserve much attention. However, under detailed consideration
this patch appears to be significant improvement on the way of making full text
search in PostgreSQL to be done the right way.
I’ve started working on FTS improvements in 2012. That time I realized that GIN
index is good for selective FTS queries, when number of matching results is low.
See the example below: GIN did great work for us by returning just few dozens of
matching rows very fast. The rest operations including relevance calculation
and sorting are also fast, because they are performed over very small row set.
But situation is different if FTS query is not selective and number of matching
rows is high. Then we have fetch all those rows from heap, calculate relevance
for each of them and sort them. And despite we only need TOP-10 rows, this
query takes a lot of time.
How can we improve this situation? If we would get results from index
pre-ordered by relevance, then we would be able to evaluate TOP-N query
without fetching the whole set of matching rows from heap. Unfortunately,
that appears to be impossible for GIN index which stores only facts of occurence
of specifix terms in document. But if we have additional infromation
about terms positions in the index, then it might work. That information
would be enough to calculate relevance only basing on index information.
Thus, I’ve proposed proposed
a set of patches to GIN index. Some improvements were committed including
index compression and index search optimization. However, additional information storage for GIN
index wasn’t committed, because it alters GIN index structure too much.
Fortunately, we have
extensible index access methods
in PostgreSQL 9.6. And that enables us to implement things, which wasn’t
committed to GIN and more, as a separate index access method
RUM. Using RUM, one can execute TOP-N
FTS query much faster without fetching all the matching rows from heap.
However, the problem persisted if you need to get total count of matching rows.
Then PostgreSQL executor still have to fetch all the matching rows from the
heap in order to check their visibility. So, if you need total number of
resulting rows for pagination, then it’s still might be very slow.
For sure, some modern UIs use techniques like continuous scrolling which doesn’t
require to show full number of results to user. Also, one can use planner
estimation for number of resulting rows which is typically matching the order
of magnitude to actual number of resulting rows. But nevertheless, slow counting
of total results number was a problem for many of RUM users.