Faceted search is very popular buzzword nowadays. In short, faceted search
specialty is that its results are organized per category. Popular search
engines are receiving special support of faceted search.
Let’s see what PostgreSQL can do in this field. At first, let’s formalize our
task. For each category which have matching documents we want to obtain:
Total number of matching documents;
TOP N matching documents.
For sure, it’s possible to query such data using multiple per category SQL
queries. But we’ll make it in a single SQL query. That also would be faster
in majority of cases. The query below implements faceted search over
PostgreSQL mailing lists archives using window functions and CTE. Usage
of window function is essential while CTE was used for better query readability.
/* * Select all matching messages, calculate rank within list and total count * within list using window functions. */WITHmsgAS(SELECTmessage_id,subject,list,RANK()OVER(PARTITIONBYlistORDERBYts_rank_cd(body_tsvector,plainto_tsquery('index bloat')),id)rank,COUNT(*)OVER(PARTITIONBYlist)cntFROMmessagesWHEREbody_tsvector@@plainto_tsquery('index bloat')),/* Aggregate messages and count per list into json. */lstAS(SELECTlist,jsonb_build_object('count',cnt,'results',jsonb_agg(jsonb_build_object('message_id',message_id,'subject',subject)))ASdataFROMmsgWHERErank<=5GROUPbylist,cnt)/* Aggregate per list data into single json */SELECTjsonb_object_agg(list,data)FROMlst;
The resulting JSON document contains total count of matching mailing list
messages and TOP 5 relevant messages for each list.
Faceted search JSON result
12345678910111213141516171819202122232425
{"pgsql-admin":{"count":263,"results":[{"message_id":"CACjxUsMUWkY1Z2K2A6yVdF88GT3xcFw5ofWTR6r1zqLUYu0WzA@mail.gmail.com","subject":"Re: Slow planning time"},{"message_id":"dcc563d11001041749w561874f7y6574fb42ab49f850@mail.gmail.com","subject":"Re: Finetuning Autovacuum"},{"message_id":"AANLkTikWabMzCRCSWuNLuPizSSQX3YILgJrNZuzgp3yM@mail.gmail.com","subject":"Re: blocking automatic vacuum"},{"message_id":"dcc563d10904011631l4058aabew12f3fe4895a072f3@mail.gmail.com","subject":"Re: Vacuum Full"},{"message_id":"FE44E0D7EAD2ED4BB2165071DB8E328C03062D9D@egcrc-ex01.egcrc.org","subject":"Re: postgres bogged down beyond tolerance"}]},/*................................................................................*/"pgsql-advocacy":{"count":8,"results":[{"message_id":"Pine.LNX.4.33.0310291602220.22178-100000@css120.ihs.com","subject":"Re: Press Release"},{"message_id":"20050502203626.GA29791@dcc.uchile.cl","subject":"Re: [HACKERS] Increased company involvement"},{"message_id":"5d94f7afb26f56652e06ba0657573ef2@biglumber.com","subject":"Search and archives still out of sync"},{"message_id":"Pine.GSO.4.64.0708010705100.13114@westnet.com","subject":"Re: postgresql publication"},{"message_id":"20070801151739.GF6165@alvh.no-ip.org","subject":"Re: postgresql publication"}]}}
In the plan of this query we can see that message_body_idx GIN index is
scanned only once, and this is great.
Thus, it appears that nothing prevents you from implementing trendy kinds of
searches using old good SQL and powerful features of PostgreSQL including:
fulltext search, JSON support, window functions etc.