Faceted search is very popular buzzword nowadays. In short, faceted search specialty is that its results are organized per category. Popular search engines are receiving special support of faceted search.

Let’s see what PostgreSQL can do in this field. At first, let’s formalize our task. For each category which have matching documents we want to obtain:

  • Total number of matching documents;
  • TOP N matching documents.

For sure, it’s possible to query such data using multiple per category SQL queries. But we’ll make it in a single SQL query. That also would be faster in majority of cases. The query below implements faceted search over PostgreSQL mailing lists archives using window functions and CTE. Usage of window function is essential while CTE was used for better query readability.

Faceted search SQL query
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
/*
 * Select all matching messages, calculate rank within list and total count
 * within list using window functions.
 */
WITH msg AS (
    SELECT
        message_id,
        subject,
        list,
        RANK() OVER (
            PARTITION BY list
            ORDER BY ts_rank_cd(body_tsvector,  plainto_tsquery('index bloat')), id
        ) rank,
        COUNT(*) OVER (PARTITION BY list) cnt
    FROM messages
    WHERE body_tsvector @@ plainto_tsquery('index bloat')
),
/* Aggregate messages and count per list into json. */
lst AS (
    SELECT
        list,
        jsonb_build_object(
            'count', cnt,
            'results', jsonb_agg(
                jsonb_build_object(
                    'message_id', message_id,
                    'subject', subject
        ))) AS data
    FROM msg
    WHERE rank <= 5
    GROUP by list, cnt
)
/* Aggregate per list data into single json */
SELECT  jsonb_object_agg(list, data)
FROM    lst;

The resulting JSON document contains total count of matching mailing list messages and TOP 5 relevant messages for each list.

Faceted search JSON result
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
{
  "pgsql-admin": {
    "count": 263,
    "results": [
      {"message_id": "CACjxUsMUWkY1Z2K2A6yVdF88GT3xcFw5ofWTR6r1zqLUYu0WzA@mail.gmail.com", "subject": "Re: Slow planning time"},
      {"message_id": "dcc563d11001041749w561874f7y6574fb42ab49f850@mail.gmail.com", "subject": "Re: Finetuning Autovacuum"},
      {"message_id": "AANLkTikWabMzCRCSWuNLuPizSSQX3YILgJrNZuzgp3yM@mail.gmail.com", "subject": "Re: blocking automatic vacuum"},
      {"message_id": "dcc563d10904011631l4058aabew12f3fe4895a072f3@mail.gmail.com", "subject": "Re: Vacuum Full"},
      {"message_id": "FE44E0D7EAD2ED4BB2165071DB8E328C03062D9D@egcrc-ex01.egcrc.org", "subject": "Re: postgres bogged down beyond tolerance"
      }
    ]
  },
/*................................................................................*/
  "pgsql-advocacy": {
    "count": 8,
    "results": [
      {"message_id": "Pine.LNX.4.33.0310291602220.22178-100000@css120.ihs.com", "subject": "Re: Press Release"},
      {"message_id": "20050502203626.GA29791@dcc.uchile.cl", "subject": "Re: [HACKERS] Increased company involvement"},
      {"message_id": "5d94f7afb26f56652e06ba0657573ef2@biglumber.com", "subject": "Search and archives still out of sync"},
      {"message_id": "Pine.GSO.4.64.0708010705100.13114@westnet.com", "subject": "Re: postgresql publication"},
      {"message_id": "20070801151739.GF6165@alvh.no-ip.org", "subject": "Re: postgresql publication"
      }
    ]
  }
}

In the plan of this query we can see that message_body_idx GIN index is scanned only once, and this is great.

Plan of faceted search SQL query
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
                                                                   QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2369.50..2369.51 rows=1 width=114) (actual time=34.232..34.232 rows=1 loops=1)
   CTE msg
 ->  WindowAgg  (cost=2087.93..2354.30 rows=491 width=336) (actual time=30.925..33.087 rows=2486 loops=1)
       ->  WindowAgg  (cost=2087.93..2222.96 rows=491 width=336) (actual time=30.716..32.020 rows=2486 loops=1)
             ->  Sort  (cost=2087.93..2089.16 rows=491 width=336) (actual time=30.711..30.838 rows=2486 loops=1)
                   Sort Key: messages.list, (ts_rank_cd(messages.body_tsvector, plainto_tsquery('index bloat'::text))), messages.id
                   Sort Method: quicksort  Memory: 582kB
                   ->  Bitmap Heap Scan on messages  (cost=48.05..2065.98 rows=491 width=336) (actual time=3.037..24.345 rows=2486 loops=1)
                         Recheck Cond: (body_tsvector @@ plainto_tsquery('index bloat'::text))
                         Heap Blocks: exact=2044
                         ->  Bitmap Index Scan on message_body_idx  (cost=0.00..47.93 rows=491 width=0) (actual time=2.723..2.723 rows=2486 loo
                               Index Cond: (body_tsvector @@ plainto_tsquery('index bloat'::text))
   CTE lst
 ->  HashAggregate  (cost=12.69..13.69 rows=67 width=540) (actual time=34.090..34.133 rows=14 loops=1)
       Group Key: msg.list, msg.cnt
       ->  CTE Scan on msg  (cost=0.00..11.05 rows=164 width=540) (actual time=30.928..33.879 rows=68 loops=1)
             Filter: (rank <= 5)
             Rows Removed by Filter: 2418
   ->  CTE Scan on lst  (cost=0.00..1.34 rows=67 width=114) (actual time=34.092..34.140 rows=14 loops=1)
 Planning time: 0.380 ms
 Execution time: 34.357 ms

Thus, it appears that nothing prevents you from implementing trendy kinds of searches using old good SQL and powerful features of PostgreSQL including: fulltext search, JSON support, window functions etc.

Comments