RuntimeAppend in Pg_pathman: Achievements and New Challenges

Dealing with partitioned tables we can’t always select relevant partitions during query planning. Naturally, during query planning you can’t know values which come from subquery or outer part of nested loop join. Nevertheless, it would be ridiculous to scan all the partitions in such cases.

This is why my Postgres Professional colleague Dmitry Ivanov developed a new custom executor node for pg_pathman: RuntimeAppend. This node behaves like regular Append node: it contains set of children Nodes which should be appended. However, RuntimeAppend have one distinction: each run it selects only relevant children to append basing on parameter values.

Let’s consider example: join of journal table which contains row per each 30 seconds of year partitioned by day, and q table which refers 1000 random rows of journal table. Without RuntimeAppend optimizer selects Hash Join plan.

Regular Append: Hash Join

# EXPLAIN ANALYZE SELECT * FROM q JOIN journal j ON q.dt = j.dt;
                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=27.50..25442.51 rows=1000 width=56) (actual time=0.479..252.506 rows=1000 loops=1)
   Hash Cond: (j.dt = q.dt)
   ->  Append  (cost=0.00..21463.01 rows=1051201 width=49) (actual time=0.005..152.258 rows=1051201 loops=1)
         ->  Seq Scan on journal_1 j  (cost=0.00..58.80 rows=2880 width=49) (actual time=0.004..0.247 rows=2880 loops=1)
         ->  Seq Scan on journal_2 j_1  (cost=0.00..58.80 rows=2880 width=49) (actual time=0.001..0.208 rows=2880 loops=1)
         ->  Seq Scan on journal_3 j_2  (cost=0.00..58.80 rows=2880 width=49) (actual time=0.001..0.197 rows=2880 loops=1)
...............................................................................................................................
         ->  Seq Scan on journal_366 j_365  (cost=0.00..1.01 rows=1 width=49) (actual time=0.001..0.001 rows=1 loops=1)
   ->  Hash  (cost=15.00..15.00 rows=1000 width=8) (actual time=0.185..0.185 rows=1000 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 48kB
         ->  Seq Scan on q  (cost=0.00..15.00 rows=1000 width=8) (actual time=0.003..0.074 rows=1000 loops=1)
 Planning time: 29.262 ms
 Execution time: 256.337 ms
(374 rows)

The Hash Join execution takes 256 milliseconds for execution and 29 milliseconds for planning. Relatively high planning time is expected because all the partitions are present in plan. It’s surprising that optimizer didn’t select Nested Loop join. Let’s force it to do so by enable_hashjoin = off and enable_mergejoin = off.

Regular Append: Nested Loop

# EXPLAIN ANALYZE SELECT * FROM q JOIN journal j ON q.dt = j.dt;
                                                                      QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.28..170817.00 rows=1000 width=56) (actual time=1.091..452.658 rows=1000 loops=1)
   ->  Seq Scan on q  (cost=0.00..15.00 rows=1000 width=8) (actual time=0.006..0.158 rows=1000 loops=1)
   ->  Append  (cost=0.28..167.14 rows=366 width=49) (actual time=0.218..0.438 rows=1 loops=1000)
         ->  Index Scan using journal_1_dt_idx on journal_1 j  (cost=0.28..0.46 rows=1 width=49) (actual time=0.001..0.001 rows=0 loops=1000)
               Index Cond: (dt = q.dt)
         ->  Index Scan using journal_2_dt_idx on journal_2 j_1  (cost=0.28..0.46 rows=1 width=49) (actual time=0.001..0.001 rows=0 loops=1000)
               Index Cond: (dt = q.dt)
         ->  Index Scan using journal_3_dt_idx on journal_3 j_2  (cost=0.28..0.46 rows=1 width=49) (actual time=0.001..0.001 rows=0 loops=1000)
               Index Cond: (dt = q.dt)
......................................................................................................................................................
         ->  Index Scan using journal_366_dt_idx on journal_366 j_365  (cost=0.12..0.15 rows=1 width=49) (actual time=0.001..0.001 rows=0 loops=1000)
               Index Cond: (dt = q.dt)
 Planning time: 29.922 ms
 Execution time: 456.140 ms
(737 rows)

The Nested Loop join takes 456 milliseconds to execute. This is even worse. But this is understandable because we have to scan each partition of journal for each row of q.

Finally, let’s enable RuntimeAppend.

RuntimeAppend

# EXPLAIN ANALYZE SELECT * FROM q JOIN journal j ON q.dt = j.dt;
                                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.28..481.67 rows=1000 width=56) (actual time=0.041..9.911 rows=1000 loops=1)
   ->  Seq Scan on q  (cost=0.00..15.00 rows=1000 width=8) (actual time=0.005..0.079 rows=1000 loops=1)
   ->  Custom Scan (RuntimeAppend)  (cost=0.28..0.46 rows=1 width=49) (actual time=0.003..0.003 rows=1 loops=1000)
         ->  Index Scan using journal_330_dt_idx on journal_330 j  (cost=0.28..0.46 rows=1 width=49) (actual time=0.003..0.003 rows=1 loops=5)
               Index Cond: (dt = q.dt)
         ->  Index Scan using journal_121_dt_idx on journal_121 j  (cost=0.28..0.46 rows=1 width=49) (actual time=0.004..0.004 rows=1 loops=1)
               Index Cond: (dt = q.dt)
         ->  Index Scan using journal_37_dt_idx on journal_37 j  (cost=0.28..0.46 rows=1 width=49) (actual time=0.003..0.003 rows=1 loops=4)
               Index Cond: (dt = q.dt)
................................................................................................................................................
         ->  Index Scan using journal_355_dt_idx on journal_355 j  (cost=0.28..0.46 rows=1 width=49) (actual time=0.003..0.003 rows=1 loops=1)
               Index Cond: (dt = q.dt)
 Planning time: 30.775 ms
 Execution time: 8.615 ms
(687 rows)

The Nested Loop join with RuntimeAppend takes only about 9 milliseconds to execute! Such fast execution is possible thanks to RuntimeAppend scans only one relevant partition of journal for each row of q.

Nevertheless, all the partitions are present in plan and planning time is still quite high. This relatively high planning time could be not so significant for prepared statements or long OLAP queries.

However, long planning time appears to be not the only problem. We run a benchmark when RuntimeAppend node returns just a few rows in prepared statement. Despite high planning time doesn’t affect prepared statements, TPS was few time slower than it was without partitioning. After running perf, we got this flamegraph. This flamegraph shows that we spend very significant time for locking and unlocking every partition. Naturally, locking 365 partitions isn’t using fast-path locking and appears to be significant overhead.

Thus, we see how huge benefit could runtime partition selection have. However, in current design having all the partitions in plan cause high overhead. Solution could be found in redesigning partition locking. We are researching this problem now. It’s likely this problem can’t be solved in the boundaries of extension and proper solution requires hacking of PostgreSQL core.

Comments