It’s not very widely known, but PostgreSQL is gathering statistics for indexed expressions. See following example.

1
2
3
4
5
6
7
8
9
CREATE TABLE test AS (SELECT random() x, random() y FROM generate_series(1,1000000));
ANALYZE test;

EXPLAIN ANALYZE SELECT * FROM test WHERE x + y < 0.01;
                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..20406.00 rows=333333 width=16) (actual time=1.671..113.693 rows=56 loops=1)
   Filter: ((x + y) < '0.01'::double precision)
   Rows Removed by Filter: 999944

We created table with two columns x and y whose values are independently and uniformly distributed from 0 to 1. Despite we analyze that table, PostgreSQL optimizer estimates selectivity of x + y < 0.01 qual as 1/3. You can see that this estimation is not even close to reality: we actually selected 56 rows instead of 333333 rows estimated. This estimation comes from a rough assumption that < operator selects 1/3 of rows unless something more precise is known. Of course, it could be possible for planner to do something better in this case. For instance, it could try to calculate histogram for x + y from the separate histograms for x and y. However, PostgreSQL optimizer doesn’t perform such costly and complex computations for now.

Situation changes once we define an index on x + y.

1
2
3
4
5
6
7
8
9
10
11
CREATE INDEX test_idx ON test ((x + y));
ANALYZE test;

EXPLAIN ANALYZE SELECT * FROM test WHERE x + y < 0.01;
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on test  (cost=13.39..1838.32 rows=641 width=16) (actual time=0.040..0.107 rows=56 loops=1)
   Recheck Cond: ((x + y) < '0.01'::double precision)
   Heap Blocks: exact=56
   ->  Bitmap Index Scan on test_idx  (cost=0.00..13.23 rows=641 width=0) (actual time=0.028..0.028 rows=56 loops=1)
         Index Cond: ((x + y) < '0.01'::double precision)

Besides index get used for this query, there is way more accurate estimate for the number of rows selected by x + y < 0.01. Estimation is improved because PostgreSQL is now gathering separate statistics for x + y expression. You can check that by querying a system catalog.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
SELECT * FROM pg_stats WHERE tablename = 'test_idx';
-[ RECORD 1 ]----------+--------------------------------------------------------------------------------------------------------------------------------------------
schemaname             | public
tablename              | test_idx
attname                | expr
inherited              | f
null_frac              | 0
avg_width              | 8
n_distinct             | -0.999863
most_common_vals       | {0.262215601745993,0.319712610449642,0.3959802063182,0.404356196057051,0.40578526025638,0.437070866115391,0.462984828744084,0.4651908758096
most_common_freqs      | {2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,
histogram_bounds       | {0.00104234321042895,0.0141074191778898,0.0200657406821847,0.0247588600032032,0.0284962640143931,0.0315022920258343,0.0346860070712864,0.03
correlation            | -0.00176553
most_common_elems      | NULL
most_common_elem_freqs | NULL
elem_count_histogram   | NULL

So, there are histogram, most common values and etc for x + y expression, and that leads to more accurate selectivity estimation for x + y < 0.01. However, there is still and 1 order of degree error (641 rows estimated instead of 56). Could we improve that? Yes, PostgreSQL have statistics-gathering target parameter which is tunable per column using ALTER TABLE … SET STATISTICS … command. Using this command, you may tune size of statistics arrays.

But, uhhhh, in our case we have no column, we have an indexed expression. That appears to be a problem since there is no documented way to tune statistic target for that…

Nevertheless, it appears to be possible. There is a gotcha which allows advanced DBAs to do that.

1
2
3
4
5
6
7
8
9
10
11
ALTER INDEX test_idx ALTER COLUMN expr SET STATISTICS 10000;
ANALYZE test;

EXPLAIN ANALYZE SELECT * FROM test WHERE x + y < 0.01;
                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on test  (cost=4.96..258.61 rows=69 width=16) (actual time=0.022..0.074 rows=56 loops=1)
   Recheck Cond: ((x + y) < '0.01'::double precision)
   Heap Blocks: exact=56
   ->  Bitmap Index Scan on test_idx  (cost=0.00..4.94 rows=69 width=0) (actual time=0.014..0.014 rows=56 loops=1)
         Index Cond: ((x + y) < '0.01'::double precision)

That works. When we collect statistic arrays of 10000 size, estimate becomes 69 rows. It’s only 23% estimation error which is more than good enough for query planning.

But… What the hell is ALTER INDEX ... SET STATISTICS ...?! There is nothing like this in PostgreSQL documentation!

Let’s understand this situation step by step.

  1. ALTER INDEX and ALTER TABLE share the same bison rule.
  2. Cases when ALTER INDEX is not applicable are filtered runtime.
  3. ALTER INDEX ... SET STATISTICS ... is not forbidden and works the same way as ALTER TABLE ... SET STATISTICS ... does.
  4. Indexed expressions are internally named as attributes: expr, expr1, expr2

There was some short discussion about that in pgsql-hackers mailing lists. The conclusion was that this should be documented, but it’s not yet done. I also think that we should invent some better syntax for that instead of usage of internal column names.

Comments