It’s not very widely known, but PostgreSQL is gathering statistics for indexed expressions. See following example.
1 2 3 4 5 6 7 8 9
We created table with two columns
y whose values are independently and uniformly distributed from 0 to 1. Despite we analyze that table, PostgreSQL optimizer estimates selectivity of
x + y < 0.01 qual as 1/3. You can see that this estimation is not even close to reality: we actually selected 56 rows instead of 333333 rows estimated. This estimation comes from a rough assumption that
< operator selects 1/3 of rows unless something more precise is known. Of course, it could be possible for planner to do something better in this case. For instance, it could try to calculate histogram for
x + y from the separate histograms for
y. However, PostgreSQL optimizer doesn’t perform such costly and complex computations for now.
Situation changes once we define an index on
x + y.
1 2 3 4 5 6 7 8 9 10 11
Besides index get used for this query, there is way more accurate estimate for the number of rows selected by
x + y < 0.01. Estimation is improved because PostgreSQL is now gathering separate statistics for
x + y expression. You can check that by querying a system catalog.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
So, there are histogram, most common values and etc for
x + y expression, and that leads to more accurate selectivity estimation for
x + y < 0.01. However, there is still and 1 order of degree error (641 rows estimated instead of 56). Could we improve that? Yes, PostgreSQL have statistics-gathering target parameter which is tunable per column using ALTER TABLE … SET STATISTICS … command. Using this command, you may tune size of statistics arrays.
But, uhhhh, in our case we have no column, we have an indexed expression. That appears to be a problem since there is no documented way to tune statistic target for that…
Nevertheless, it appears to be possible. There is a gotcha which allows advanced DBAs to do that.
1 2 3 4 5 6 7 8 9 10 11
That works. When we collect statistic arrays of 10000 size, estimate becomes 69 rows. It’s only 23% estimation error which is more than good enough for query planning.
But… What the hell is
ALTER INDEX ... SET STATISTICS ...?! There is nothing like this in PostgreSQL documentation!
Let’s understand this situation step by step.
ALTER TABLEshare the same bison rule.
- Cases when
ALTER INDEXis not applicable are filtered runtime.
ALTER INDEX ... SET STATISTICS ...is not forbidden and works the same way as
ALTER TABLE ... SET STATISTICS ...does.
- Indexed expressions are internally named as attributes:
There was some short discussion about that in pgsql-hackers mailing lists. The conclusion was that this should be documented, but it’s not yet done. I also think that we should invent some better syntax for that instead of usage of internal column names.