This is one of the most common questions asked by developers who write SQL queries against the PostgreSQL database. There are multiple ways in which a sub select or lookup can be framed in a SQL statement. PostgreSQL optimizer is very smart at optimizing queries, and many of the queries can be rewritten/transformed for better performance.
Let’s discuss the topic with an example, for which I am using schema created by pgbench.
Note: For those not familiar with pgbench, it is a micro benchmarking tool shipped with PostgreSQL. A sample pgbench schema can be initialized with some data as follows:
1 |
pgbench -i -s 10 |
For this example, I have updated the branch balance of a couple of branches:
1 |
update pgbench_branches set bbalance=4500000 where bid in (4,7); |
INclusion Queries
The SQL Challenge for this example is: Find out the number of accounts per branch from pgbench_accounts for those branches where branch level balance is greater than zero. This query can be written in four different ways, as per ANSI SQL Standards.
1. Using IN Clause
1 2 3 |
SELECT count(aid),bid FROM pgbench_accounts WHERE bid in (SELECT bid FROM pgbench_branches WHERE bbalance > 0) GROUP BY bid; |
2. Using ANY Clause
1 2 3 |
SELECT count(aid),bid FROM pgbench_accounts WHERE bid = ANY(SELECT bid FROM pgbench_branches WHERE bbalance > 0) GROUP BY bid; |
3. Using EXISTS Clause
1 2 3 4 5 |
SELECT count(aid),bid FROM pgbench_accounts WHERE EXISTS (SELECT bid FROM pgbench_branches WHERE bbalance > 0 AND pgbench_accounts.bid = pgbench_branches.bid) GROUP BY bid; |
4. Using INNER JOIN
1 2 3 4 5 |
SELECT count(aid),a.bid FROM pgbench_accounts a JOIN pgbench_branches b ON a.bid = b.bid WHERE b.bbalance > 0 GROUP BY a.bid; |
While writing the query, one might assume that EXISTS and INNER JOIN might be better because they can use all the logic and optimization for joining two tables, while IN and ANY clauses need to deal with subqueries. However, PostgreSQL (at least PG 10 and above) is smart enough to produce the same execution plan for all four options!.
All of the above queries will be generating the same execution plan as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
HashAggregate (cost=31132.65..31132.75 rows=10 width=12) (actual time=279.625..279.626 rows=2 loops=1) Group Key: a.bid -> Hash Join (cost=1.15..30132.65 rows=200000 width=8) (actual time=63.686..242.956 rows=200000 loops=1) Hash Cond: (a.bid = b.bid) -> Seq Scan on pgbench_accounts a (cost=0.00..26394.00 rows=1000000 width=8) (actual time=0.012..86.250 rows=1000000 loops=1) -> Hash (cost=1.12..1.12 rows=2 width=4) (actual time=0.016..0.016 rows=2 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on pgbench_branches b (cost=0.00..1.12 rows=2 width=4) (actual time=0.010..0.012 rows=2 loops=1) Filter: (bbalance > 0) Rows Removed by Filter: 8 Planning Time: 0.257 ms Execution Time: 279.703 ms (12 rows) |
Note: Suppress the parallel execution for better readability and a simple execution plan. Even with a parallel execution plan, all the queries are producing the same execution plan.
So can we conclude that we can write the query as we are comfortable and PostgreSQL’s intelligence will take care of the rest? Wait! Things can go differently if we take the exclusion scenario.
Exclusion Queries
The SQL challenge becomes: Find out the number of accounts per branch from pgbench_accounts EXCEPT for those branches where branch level balance is greater than zero.
So the four ways to write queries becomes:
1. Using NOT IN
1 2 3 |
SELECT count(aid),bid FROM pgbench_accounts WHERE bid NOT IN (SELECT bid FROM pgbench_branches WHERE bbalance > 0) GROUP BY bid; |
2. Using <> ALL
1 2 3 |
SELECT count(aid),bid FROM pgbench_accounts WHERE bid <> ALL(SELECT bid FROM pgbench_branches WHERE bbalance > 0) GROUP BY bid; |
3. Using NOT EXISTS
1 2 3 4 5 |
SELECT count(aid),bid FROM pgbench_accounts WHERE NOT EXISTS (SELECT bid FROM pgbench_branches WHERE bbalance > 0 AND pgbench_accounts.bid = pgbench_branches.bid) GROUP BY bid; |
4. Using LEFT JOIN and IS NULL
1 2 3 4 5 |
SELECT count(aid),a.bid FROM pgbench_accounts a LEFT JOIN pgbench_branches b ON a.bid = b.bid AND b.bbalance > 0 WHERE b.bid IS NULL GROUP BY a.bid; |
The “NOT IN” and “<> ALL” produces an execution plan with sub-queries (SubPlan). They are respectively:
1 2 3 4 5 6 7 8 9 10 11 12 |
HashAggregate (cost=31395.13..31395.23 rows=10 width=12) (actual time=395.297..395.299 rows=8 loops=1) Group Key: pgbench_accounts.bid -> Seq Scan on pgbench_accounts (cost=1.13..28895.13 rows=500000 width=8) (actual time=0.042..250.086 rows=800000 loops=1) Filter: (NOT (hashed SubPlan 1)) Rows Removed by Filter: 200000 SubPlan 1 -> Seq Scan on pgbench_branches (cost=0.00..1.12 rows=2 width=4) (actual time=0.010..0.012 rows=2 loops=1) Filter: (bbalance > 0) Rows Removed by Filter: |