Full Text Search Made (Almost) Right in PostgreSQL 11
Long story short, using PostgreSQL 11 and RUM index
you can do both TOP-N query and COUNT(*) query for non-selective FTS queries without
fetching all the matching results from heap (and that is certainly much faster).
If you’re interested in details, then please read the detailed description below.
At November 1st 2017, Tom Lane committed a patch
enabling bitmap scans to behave like index-only scan when possible.
In particular, since PostgreSQL 11 COUNT(*) queries can be evaluated using
bitmap scans without accessing heap, when corresponding bit in visibility map
Author: Tom Lane <firstname.lastname@example.org>
Date: Wed Nov 1 17:38:12 2017 -0400
Allow bitmap scans to operate as index-only scans when possible.
If we don't have to return any columns from heap tuples, and there's
no need to recheck qual conditions, and the heap page is all-visible,
then we can skip fetching the heap page altogether.
Skip prefetching pages too, when possible, on the assumption that the
recheck flag will remain the same from one page to the next. While that
assumption is hardly bulletproof, it seems like a good bet most of the
time, and better than prefetching pages we don't need.
This commit installs the executor infrastructure, but doesn't change
any planner cost estimates, thus possibly causing bitmap scans to
not be chosen in cases where this change renders them the best choice.
I (tgl) am not entirely convinced that we need to account for this
behavior in the planner, because I think typically the bitmap scan would
get chosen anyway if it's the best bet. In any case the submitted patch
took way too many shortcuts, resulting in too many clearly-bad choices,
to be committable.
Alexander Kuzmenkov, reviewed by Alexey Chernyshov, and whacked around
rather heavily by me.
This commit might seem to be just one of planner and executor optimizations,
nice but doesn’t deserve much attention. However, under detailed consideration
this patch appears to be significant improvement on the way of making fulltext
search in PostgreSQL to be done in the right way.
I’ve started working on FTS improvements in 2012. That time I realized that GIN
index is good for selective FTS queries, when the number of matching results is
low. See an example below: GIN did the great work by returning just few
dozens of matching rows very fast. The rest of operations including rank
calculation and sorting are also fast, because they are made over very small
But the situation is different when FTS query is not selective, and number of
matching rows is high. Then we have fetch all the matching rows from the heap,
rank them, and then sort. And despite we only need TOP-10 of matching rows,
this query takes a lot of time. See the example below: “Tom Lane” is for sure
non-selective query over PostgreSQL mailing lists.
How this situation can be improved? If we would get results from index already
pre-ordered by rank, then we would be able to evaluate TOP-N query
without fetching the whole set of matching rows from heap. Unfortunately,
that appears to be impossible for GIN index, which doesn’t contain enough of
information to calculate the rank. However, once we have additional information
about terms positions in the index, it might work. That would be enough for
raking fulltext results inside the index.
Fortunately, we have
extensible index access methods
since PostgreSQL 9.6. And that enables us to implement things, which wasn’t
accepted to GIN and even more, as a separate index access method
RUM. Using RUM, one can execute TOP-N
FTS query much faster without fetching all the matching rows from the heap.
However, the problem persists if you need to get the total count of matching
rows. Then PostgreSQL executor still have to fetch all the matching rows from
the heap in order to check their visibility. So, if you need to get a total
number of matching rows for pagination, then it might be still very slow.
For sure, there are some modern UI’s techniques, which doesn’t require to
show full number of results to user (continuous scrolling, for instance).
Alternatively, one can use a planner estimation instead of true number of
matching rows. But nevertheless, slow counting of total results number was
still a problem for many of RUM users.
Situation was improved after index-only count was committed.
Naturally, PostgreSQL maintains visibility map for heap pages. When
visibility bit is set for particular heap page, then all the tuples on that page
are visible; so, visibility check can be skipped. Using visibility map, index-
only scan node is working. But, if index-only scan can skip fetching heap, then
why can’t count do the same? Now it can!
Therefore, when you run non-selective fulltext query in PostgreSQL 11 with
RUM, then neither TOP-N query neither COUNT(*) query fetches the whole set of
matching rows from the heap. That’s cool, and I would say that fulltext search
is made almost right in this case. It’s “almost”, because it would be nice
to get both TOP-N and COUNT(*) in one SQL query and single index scan. And that
is a room for improvement in future versions.