[et_pb_section fb_built=”1″ _builder_version=”4.16″ global_colors_info=”{}”][et_pb_row _builder_version=”4.16″ background_size=”initial” background_position=”top_left” background_repeat=”repeat” global_colors_info=”{}”][et_pb_column type=”4_4″ _builder_version=”4.16″ custom_padding=”|||” global_colors_info=”{}” custom_padding__hover=”|||”][et_pb_text _builder_version=”4.16″ background_size=”initial” background_position=”top_left” background_repeat=”repeat” global_colors_info=”{}”]
This is one of the most common questions asked by developers who write SQL queries against the PostgreSQL database. There are multiple ways in which a sub select or lookup can be framed in a SQL statement. PostgreSQL optimizer is very smart at optimizing queries, and many of the queries can be rewritten/transformed for better performance.
Let’s discuss the topic with an example, for which I am using schema created by pgbench.
Note: For those not familiar with pgbench, it is a micro benchmarking tool shipped with PostgreSQL. A sample pgbench schema can be initialized with some data as follows:
|
1 |
pgbench -i -s 10 |
For this example, I have updated the branch balance of a couple of branches:
|
1 |
update pgbench_branches set bbalance=4500000 where bid in (4,7); |
The SQL Challenge for this example is: Find out the number of accounts per branch from pgbench_accounts for those branches where branch level balance is greater than zero. This query can be written in four different ways, as per ANSI SQL Standards.
|
1 |
SELECT count(aid),bid FROM pgbench_accounts WHERE<br>bid in (SELECT bid FROM pgbench_branches WHERE bbalance > 0)<br>GROUP BY bid; |
|
1 |
SELECT count(aid),bid FROM pgbench_accounts WHERE<br>bid = ANY(SELECT bid FROM pgbench_branches WHERE bbalance > 0)<br>GROUP BY bid; |
|
1 |
SELECT count(aid),bid<br>FROM pgbench_accounts WHERE EXISTS<br>(SELECT bid FROM pgbench_branches WHERE bbalance > 0<br>AND pgbench_accounts.bid = pgbench_branches.bid)<br>GROUP BY bid; |
|
1 |
SELECT count(aid),a.bid<br>FROM pgbench_accounts a<br>JOIN pgbench_branches b ON a.bid = b.bid<br>WHERE b.bbalance > 0<br>GROUP BY a.bid; |
While writing the query, one might assume that EXISTS and INNER JOIN might be better because they can use all the logic and optimization for joining two tables, while IN and ANY clauses need to deal with subqueries. However, PostgreSQL (at least PG 10 and above) is smart enough to produce the same execution plan for all four options!.
All of the above queries will be generating the same execution plan as follows:
|
1 |
HashAggregate (cost=31132.65..31132.75 rows=10 width=12) (actual time=279.625..279.626 rows=2 loops=1)<br> Group Key: a.bid<br> -> Hash Join (cost=1.15..30132.65 rows=200000 width=8) (actual time=63.686..242.956 rows=200000 loops=1)<br> Hash Cond: (a.bid = b.bid)<br> -> Seq Scan on pgbench_accounts a (cost=0.00..26394.00 rows=1000000 width=8) (actual time=0.012..86.250 rows=1000000 loops=1)<br> -> Hash (cost=1.12..1.12 rows=2 width=4) (actual time=0.016..0.016 rows=2 loops=1)<br> Buckets: 1024 Batches: 1 Memory Usage: 9kB<br> -> Seq Scan on pgbench_branches b (cost=0.00..1.12 rows=2 width=4) (actual time=0.010..0.012 rows=2 loops=1)<br> Filter: (bbalance > 0)<br> Rows Removed by Filter: 8<br> Planning Time: 0.257 ms<br> Execution Time: 279.703 ms<br>(12 rows)<br> |
Note: Suppress the parallel execution for better readability and a simple execution plan. Even with a parallel execution plan, all the queries are producing the same execution plan.
So can we conclude that we can write the query as we are comfortable and PostgreSQL’s intelligence will take care of the rest? Wait! Things can go differently if we take the exclusion scenario.
The SQL challenge becomes: Find out the number of accounts per branch from pgbench_accounts EXCEPT for those branches where branch level balance is greater than zero.
So the four ways to write queries becomes:
|
1 |
SELECT count(aid),bid FROM pgbench_accounts WHERE<br>bid NOT IN (SELECT bid FROM pgbench_branches WHERE bbalance > 0)<br>GROUP BY bid; |
|
1 |
SELECT count(aid),bid FROM pgbench_accounts WHERE<br>bid <> ALL(SELECT bid FROM pgbench_branches WHERE bbalance > 0)<br>GROUP BY bid; |
|
1 |
SELECT count(aid),bid<br>FROM pgbench_accounts WHERE NOT EXISTS<br>(SELECT bid FROM pgbench_branches WHERE bbalance > 0<br>AND pgbench_accounts.bid = pgbench_branches.bid)<br>GROUP BY bid; |
|
1 |
SELECT count(aid),a.bid<br>FROM pgbench_accounts a<br>LEFT JOIN pgbench_branches b ON a.bid = b.bid AND b.bbalance > 0<br>WHERE b.bid IS NULL<br>GROUP BY a.bid; |
The “NOT IN” and “<> ALL” produces an execution plan with sub-queries (SubPlan). They are respectively:
|
1 |
HashAggregate (cost=31395.13..31395.23 rows=10 width=12) (actual time=395.297..395.299 rows=8 loops=1)<br> Group Key: pgbench_accounts.bid<br> -> Seq Scan on pgbench_accounts (cost=1.13..28895.13 rows=500000 width=8) (actual time=0.042..250.086 rows=800000 loops=1)<br> Filter: (NOT (hashed SubPlan 1))<br> Rows Removed by Filter: 200000<br> SubPlan 1<br> -> Seq Scan on pgbench_branches (cost=0.00..1.12 rows=2 width=4) (actual time=0.010..0.012 rows=2 loops=1)<br> Filter: (bbalance > 0)<br> Rows Removed by Filter: 8<br> Planning Time: 0.197 ms<br> Execution Time: 395.363 ms<br>(11 rows)<br> |
and
|
1 |
HashAggregate (cost=601394.00..601394.10 rows=10 width=12) (actual time=731.987..731.989 rows=8 loops=1)<br> Group Key: pgbench_accounts.bid<br> -> Seq Scan on pgbench_accounts (cost=0.00..598894.00 rows=500000 width=8) (actual time=0.041..579.264 rows=800000 loops=1)<br> Filter: (SubPlan 1)<br> Rows Removed by Filter: 200000<br> SubPlan 1<br> -> Materialize (cost=0.00..1.14 rows=2 width=4) (actual time=0.000..0.000 rows=2 loops=1000000)<br> -> Seq Scan on pgbench_branches (cost=0.00..1.12 rows=2 width=4) (actual time=0.010..0.012 rows=2 loops=1)<br> Filter: (bbalance > 0)<br> Rows Removed by Filter: 8<br> Planning Time: 0.203 ms<br> Execution Time: 732.142 ms<br>(12 rows)<br> |
While NOT EXISTS and LEFT JOIN produces the same execution plan without a sub-plan as follows:
|
1 |
HashAggregate (cost=41245.15..41245.25 rows=10 width=12) (actual time=500.193..500.195 rows=8 loops=1)<br> Group Key: a.bid<br> -> Hash Anti Join (cost=1.15..37245.15 rows=800000 width=8) (actual time=0.041..344.845 rows=800000 loops=1)<br> Hash Cond: (a.bid = b.bid)<br> -> Seq Scan on pgbench_accounts a (cost=0.00..26394.00 rows=1000000 width=8) (actual time=0.013..110.645 rows=1000000 loops=1)<br> -> Hash (cost=1.12..1.12 rows=2 width=4) (actual time=0.018..0.018 rows=2 loops=1)<br> Buckets: 1024 Batches: 1 Memory Usage: 9kB<br> -> Seq Scan on pgbench_branches b (cost=0.00..1.12 rows=2 width=4) (actual time=0.011..0.012 rows=2 loops=1)<br> Filter: (bbalance > 0)<br> Rows Removed by Filter: 8<br> Planning Time: 0.248 ms<br> Execution Time: 500.266 ms<br>(12 rows)<br> |
These direct hash (anti) joins between the tables is the smartest way to answer the query. So this stands as a strong reason for recommending the EXISTS syntax or JOIN syntax. So the general rule of thumb favoring EXISTS/JOINs is holding good.
But wait! Do we see a better execution time with the NOT IN clause even with a sub-plan? Yes. PostgreSQL has done excellent optimization, thereby preparing a Hash of sub-plan NOT (hashed SubPlan 1). So PostgreSQL has a better understanding of how to deal with an IN clause, which is the logical way of thinking, as many people tend to write with IN clause. But we have very few rows (two) returned by sub-plan. The same happens even if the subquery returns a few hundred rows.
But what if there is a large number of rows (few hundreds of thousands of rows) returned by subquery? Let’s try a simple example:
|
1 |
CREATE TABLE t1 AS<br>SELECT * FROM generate_series(0, 500000) id;<br><br>CREATE TABLE t2 AS<br>SELECT (random() * 4000000)::integer id<br>FROM generate_series(0, 4000000);<br><br>ANALYZE t1;<br>ANALYZE t2;<br><br>EXPLAIN SELECT id<br>FROM t1<br>WHERE id NOT IN (SELECT id FROM t2); |
In this case, the execution plan is:
|
1 |
Seq Scan on t1 (cost=0.00..2583268004.52 rows=250000 width=4)<br> Filter: (NOT (SubPlan 1))<br> SubPlan 1<br> -> Materialize (cost=0.00..9333.01 rows=400001 width=4)<br> -> Seq Scan on t2 (cost=0.00..5770.01 rows=400001 width=4)<br>(5 rows)<br> |
In this case, the execution plan switches to the materialization of the result of the sub-plan, and the estimated cost jumps to 25831564501.02! (With PostgreSQL default settings, if the number of rows from t2 is lesser than 100k approximately, it uses the hashed sub-plan as we discussed.)
This will result in substantial degradation of performance. So the IN clause works great if the sub-plan selects a fewer number of rows.
The catch here is when development happens, there will be fewer rows in tables, and it works differently as the number of rows increases, as the execution plan drifts and can result in big performance issues in live production.
Yes, there could be datatype conversions happening when we write the query in a different way.
For example, a statement like:
|
1 |
EXPLAIN ANALYZE SELECT * FROM emp WHERE gen = ANY(ARRAY['M','F']); |
is resulting in implicit datatype conversion of the values of the fields to text.
|
1 |
Seq Scan on emp (cost=0.00..1.04 rows=2 width=43) (actual time=0.023..0.026 rows=3 loops=1)<br> Filter: ((gen)::text = ANY ('{M,F}'::text[])) |
Please note the datatype conversion: (gen)::text . On a big table, this type of conversion will have overhead, whereas PostgreSQL does a better job in dealing with the IN clause.
|
1 |
EXPLAIN ANALYZE SELECT * FROM emp WHERE gen IN ('M','F');<br><br> Seq Scan on emp (cost=0.00..1.04 rows=3 width=43) (actual time=0.030..0.034 rows=3 loops=1)<br> Filter: (gen = ANY ('{M,F}'::bpchar[]))<br> |
Even though the IN clause is converted into the ANY clause, there is no data type conversion of the “gen” field. And the specified values ‘M’,’F’ are converted into bpchar, which is an internal equivalent of CHAR.
My intention while writing this blog post is not to favor any particular way of writing a query, but to shed some light on where things can go wrong and what should be considered.
In general, I used to suggest to developers that the key to writing a good SQL statement is to follow step by step process.
Avoid thinking from “How to break the logic” into subqueries.
Never assume that the query is performing well with a small amount of data in the table.
Use an EXPLAIN plan to understand what is going on in the background.
In general, EXISTS and direct JOIN of tables often results in good results. PostgreSQL optimizes the IN clause to a hashed sub-plan in many cases. “IN” can result in a better plan and execution in some specific situations. Again, everything depends on how a query is rewritten/transformed internally. It is worth investing time in rewriting queries for better optimization.
Our white paper “Why Choose PostgreSQL?” looks at the features and benefits of PostgreSQL and presents some practical usage examples. We also examine how PostgreSQL can be useful for companies looking to migrate from Oracle.
[/et_pb_text][et_pb_code _builder_version=”4.20.4″ _module_preset=”default” hover_enabled=”0″ sticky_enabled=”0″ custom_margin=”20px||||false|false”]
[/et_pb_code][/et_pb_column][/et_pb_row][/et_pb_section]
Resources
RELATED POSTS