RuntimeAppend in Pg_pathman: Achievements and New Challenges
Dealing with partitioned tables we can’t always select relevant partitions
during query planning. Naturally, during query planning you can’t know values
which come from subquery or outer part of nested loop join. Nevertheless, it
would be ridiculous to scan all the partitions in such cases.
This is why my Postgres Professional colleague Dmitry Ivanov developed a
new custom executor node for pg_pathman: RuntimeAppend. This node behaves
like regular Append node: it contains set of children Nodes which should be
appended. However, RuntimeAppend have one distinction: each run it selects
only relevant children to append basing on parameter values.
Let’s consider example: join of journal table which contains row per each
30 seconds of year partitioned by day, and q table which refers 1000 random
rows of journal table. Without RuntimeAppend optimizer selects Hash Join
plan.
The Hash Join execution takes 256 milliseconds for execution and 29 milliseconds
for planning. Relatively high planning time is expected because all the
partitions are present in plan. It’s surprising that optimizer didn’t select
Nested Loop join. Let’s force it to do so by enable_hashjoin = off and
enable_mergejoin = off.
The Nested Loop join takes 456 milliseconds to execute. This is even worse.
But this is understandable because we have to scan each partition of journal
for each row of q.
The Nested Loop join with RuntimeAppend takes only about 9 milliseconds
to execute! Such fast execution is possible thanks to RuntimeAppend scans only
one relevant partition of journal for each row of q.
Nevertheless, all the partitions are present in plan and planning time is still
quite high. This relatively high planning time could be not so significant
for prepared statements or long OLAP queries.
However, long planning time appears to be not the only problem. We run a
benchmark when RuntimeAppend node returns just a few rows in prepared statement.
Despite high planning time doesn’t affect prepared statements, TPS was few
time slower than it was without partitioning. After running perf, we got this
flamegraph. This flamegraph shows that
we spend very significant time for locking and unlocking every partition.
Naturally, locking 365 partitions isn’t using fast-path locking and appears to
be significant overhead.
Thus, we see how huge benefit could runtime partition selection have. However,
in current design having all the partitions in plan cause high overhead.
Solution could be found in redesigning partition locking. We are researching
this problem now. It’s likely this problem can’t be solved in the boundaries
of extension and proper solution requires hacking of PostgreSQL core.