Long story short, since PostgreSQL 14 to_tsquery('pg_class') becomes 'pg' <-> 'class' instead of 'pg' & 'class' (commit 0c4f355c6a). That is for instance, in PostgreSQL 13 and earlier to_tsquery('pg_class') matches to_tsvector('a class of pg'). But since PostgreSQL 14 it doesn’t match, but still matches to_tsvector('pg_class') and to_tsvector('pg*class'). This is incompatible change, which affects fts users, but we have to do this in order to fix phrase search design problems.

The story started with a bug when to_tsvector('pg_class pg') didn’t match to websearch_to_tsquery('"pg_class pg"').

1
2
3
4
5
# select to_tsvector('pg_class pg') @@
         websearch_to_tsquery('"pg_class pg"');
 ?column?
----------
 f

Looks strange! Naturally, when you search for some text in quotes, you expect it to match at least the exact same text in the document. But it doesn’t. My first idea was that it’s just bug of websearch_to_tsquery() function, but to_tsquery() appears to have the same problem: to_tsquery('pg_class <-> pg') doesn’t match to to_tsvector('pg_class pg') as well.

1
2
3
4
5
# select to_tsvector('pg_class pg') @@
         to_tsquery('pg_class <-> pg');
 ?column?
----------
 f

I was surprised that although phrase search arrived many years ago, basic things don’t work.

Looking under the hood, both websearch_to_tsquery('"pg_class pg"') and to_tsquery('pg_class <-> pg') compiles into ( 'pg' & 'class' ) <-> 'pg'.

1
2
3
4
5
# select websearch_to_tsquery('"pg_class pg"'),
         to_tsquery('pg_class <-> pg');
    websearch_to_tsquery     |         to_tsquery
-----------------------------+-----------------------------
 ( 'pg' & 'class' ) <-> 'pg' | ( 'pg' & 'class' ) <-> 'pg'

This tsquery expects both pg and class to be one position left from another pg. That means both pg and class need to reside in the same position. In principle, that’s possible, for instance, when a single word is split into two synonyms by fulltext dictionary. But that’s not our case. When we parse pg_class pg text, each word gets position sequentially. No two of them reside in the same position.

1
2
3
4
5
# select to_tsvector('pg_class pg');
    to_tsvector
--------------------
 'class':2 'pg':1,3
(1 row)

Why does tsquery parsing work this way? Historically, in PostgreSQL fulltext search to_tsquery('pg_class') compiles into 'pg' & 'class'. Therefore, pg and class don’t have to appear together. Before phrase search, that was the only way to process this query as soon as we split pg_class into pg and class. Thus, querying compound words was a bit relaxed. But now, when combined with phrase search, it becomes unreasonably strict.

My original intention was to choose the way to compile pg_class depending on the context. With phrase search operator nearby pg_class should become 'pg' <-> 'class', but be 'pg' & 'class' in the rest of cases. But this way required invasive refactoring of tsquery processing, taking more time than I could to spend on this bug.

Fortunately, Tom Lane came with a proposal to always compile pg_class into 'pg' <-> 'class'. Thus, now both websearch_to_tsquery('"pg_class pg"') and to_tsquery('pg_class <-> pg') compile into 'pg' <-> 'class' <-> 'pg'. And both of them match to to_tsvector('pg_class pg'). That is a win!

1
2
3
4
5
6
7
8
9
10
11
# select websearch_to_tsquery('"pg_class pg"'),
         to_tsquery('pg_class <-> pg');
   websearch_to_tsquery            to_tsquery
───────────────────────────┼───────────────────────────
 'pg' <-> 'class' <-> 'pg'  'pg' <-> 'class' <-> 'pg'

# select to_tsvector('pg_class pg') @@ websearch_to_tsquery('"pg_class pg"'),
         to_tsvector('pg_class pg') @@ to_tsquery('pg_class <-> pg');
 ?column?  ?column?
──────────┼──────────
 t         t

This approach would make all queries involving compound words more strict. But at first, this appears the only easy way to fix this design bug. Secondly, this is probably a better way to handle compound words themselves.

And AFAICS, this approach seems to be the right way. Thanks to it, yet another phrase search bug appears to be quite easy to fix.

Happy phrase searching in PostgreSQL 14! Hopefully, we would further manage without incompatible changes :)

Comments