PostgreSQL 14: Substantion Change to Fulltext Query Parsing
Long story short, since PostgreSQL 14 to_tsquery('pg_class')
becomes
'pg' <-> 'class'
instead of 'pg' & 'class'
(commit 0c4f355c6a). That is for instance,
in PostgreSQL 13 and earlier to_tsquery('pg_class')
matches
to_tsvector('a class of pg')
. But since PostgreSQL 14 it doesn’t match, but
still matches to_tsvector('pg_class')
and to_tsvector('pg*class')
.
This is incompatible change, which affects fts users, but we have to do this
in order to fix phrase search design problems.
The story started with a
bug
when to_tsvector('pg_class pg')
didn’t match to
websearch_to_tsquery('"pg_class pg"')
.
1 2 3 4 5 |
|
Looks strange! Naturally, when you search for some
text in quotes, you expect it to match at least the exact same text in the document.
But it doesn’t. My first idea was that it’s just bug of websearch_to_tsquery()
function, but to_tsquery()
appears to have the same problem:
to_tsquery('pg_class <-> pg')
doesn’t match to to_tsvector('pg_class pg')
as well.
1 2 3 4 5 |
|
I was surprised that although phrase search arrived many years ago, basic things don’t work.
Looking under the hood, both websearch_to_tsquery('"pg_class pg"')
and
to_tsquery('pg_class <-> pg')
compiles into ( 'pg' & 'class' ) <-> 'pg'
.
1 2 3 4 5 |
|
This tsquery expects both pg
and class
to be one position left from another
pg
. That means both pg
and class
need to reside in the same position.
In principle, that’s possible, for instance, when a single word is split into two
synonyms by fulltext dictionary. But that’s not our case. When we parse
pg_class pg
text, each word gets position sequentially. No two of them
reside in the same position.
1 2 3 4 5 |
|
Why does tsquery parsing work this way? Historically, in PostgreSQL fulltext search
to_tsquery('pg_class')
compiles into 'pg' & 'class'
. Therefore, pg
and
class
don’t have to appear together. Before phrase search, that was the
only way to process this query as soon as we split pg_class
into pg
and
class
. Thus, querying compound words was a bit relaxed. But now, when
combined with phrase search, it becomes unreasonably strict.
My original intention was to choose the way to compile pg_class
depending
on the context. With phrase search operator nearby pg_class
should become
'pg' <-> 'class'
, but be 'pg' & 'class'
in the rest of cases. But this
way required invasive refactoring of tsquery processing, taking more time than
I could to spend on this bug.
Fortunately, Tom Lane came with a proposal
to always compile pg_class
into 'pg' <-> 'class'
. Thus, now both
websearch_to_tsquery('"pg_class pg"')
and to_tsquery('pg_class <-> pg')
compile into 'pg' <-> 'class' <-> 'pg'
. And both of them match to
to_tsvector('pg_class pg')
. That is a win!
1 2 3 4 5 6 7 8 9 10 11 |
|
This approach would make all queries involving compound words more strict. But at first, this appears the only easy way to fix this design bug. Secondly, this is probably a better way to handle compound words themselves.
And AFAICS, this approach seems to be the right way. Thanks to it, yet another phrase search bug appears to be quite easy to fix.
Happy phrase searching in PostgreSQL 14! Hopefully, we would further manage without incompatible changes :)