Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

SQL Optimizations in PostgreSQL: IN vs EXISTS vs ANY/ALL vs JOIN

April 16, 2020

Author

Jobin Augustine

Insight for DBAs

Insight for Developers

PostgreSQL

Share this Post:

SQL optimizations in PostgreSQL
This is one of the most common questions asked by developers writing SQL queries against PostgreSQL. There are multiple ways to structure subqueries or lookups, and PostgreSQL’s optimizer is quite effective at transforming queries for better performance.

Let’s walk through an example using the pgbench schema.

Note: pgbench is a benchmarking tool included with PostgreSQL. You can initialize sample data with:

pgbench -i -s 10

1	pgbench -i -s 10

Update some sample data:

update pgbench_branches set bbalance=4500000 where bid in (4,7);

1	update pgbench_branches set bbalance=4500000 where bid in (4,7);

Inclusion Queries

Goal: Find the number of accounts per branch where branch balance is greater than zero.

1. Using IN

SELECT count(aid),bid FROM pgbench_accounts
WHERE bid IN (SELECT bid FROM pgbench_branches WHERE bbalance > 0)
GROUP BY bid;

SELECT count(aid),bid FROM pgbench_accounts

WHERE bid IN (SELECT bid FROM pgbench_branches WHERE bbalance > 0)

GROUP BY bid;

2. Using ANY

SELECT count(aid),bid FROM pgbench_accounts
WHERE bid = ANY(SELECT bid FROM pgbench_branches WHERE bbalance > 0)
GROUP BY bid;

SELECT count(aid),bid FROM pgbench_accounts

WHERE bid = ANY(SELECT bid FROM pgbench_branches WHERE bbalance > 0)

GROUP BY bid;

3. Using EXISTS

SELECT count(aid),bid
FROM pgbench_accounts
WHERE EXISTS (
  SELECT bid FROM pgbench_branches
  WHERE bbalance > 0
  AND pgbench_accounts.bid = pgbench_branches.bid
)
GROUP BY bid;

SELECT count(aid),bid

FROM pgbench_accounts

WHERE EXISTS (

SELECT bid FROM pgbench_branches

WHERE bbalance > 0

AND pgbench_accounts.bid = pgbench_branches.bid

)

GROUP BY bid;

4. Using INNER JOIN

SELECT count(aid),a.bid
FROM pgbench_accounts a
JOIN pgbench_branches b ON a.bid = b.bid
WHERE b.bbalance > 0
GROUP BY a.bid;

SELECT count(aid),a.bid

FROM pgbench_accounts a

JOIN pgbench_branches b ON a.bid = b.bid

WHERE b.bbalance > 0

GROUP BY a.bid;

PostgreSQL produces the same execution plan for all of these approaches.

HashAggregate
  -> Hash Join
       -> Seq Scan on pgbench_accounts
       -> Seq Scan on pgbench_branches (Filter: bbalance > 0)

HashAggregate

-> Hash Join

-> Seq Scan on pgbench_accounts

-> Seq Scan on pgbench_branches (Filter: bbalance > 0)

This means you can typically choose the syntax you prefer.

Exclusion Queries

Goal: Find accounts per branch excluding branches with positive balances.

1. Using NOT IN

SELECT count(aid),bid FROM pgbench_accounts
WHERE bid NOT IN (SELECT bid FROM pgbench_branches WHERE bbalance > 0)
GROUP BY bid;

SELECT count(aid),bid FROM pgbench_accounts

WHERE bid NOT IN (SELECT bid FROM pgbench_branches WHERE bbalance > 0)

GROUP BY bid;

2. Using <> ALL

SELECT count(aid),bid FROM pgbench_accounts
WHERE bid <> ALL(SELECT bid FROM pgbench_branches WHERE bbalance > 0)
GROUP BY bid;

SELECT count(aid),bid FROM pgbench_accounts

WHERE bid <> ALL(SELECT bid FROM pgbench_branches WHERE bbalance > 0)

GROUP BY bid;

3. Using NOT EXISTS

SELECT count(aid),bid
FROM pgbench_accounts
WHERE NOT EXISTS (
  SELECT bid FROM pgbench_branches
  WHERE bbalance > 0
  AND pgbench_accounts.bid = pgbench_branches.bid
)
GROUP BY bid;

SELECT count(aid),bid

FROM pgbench_accounts

WHERE NOT EXISTS (

SELECT bid FROM pgbench_branches

WHERE bbalance > 0

AND pgbench_accounts.bid = pgbench_branches.bid

)

GROUP BY bid;

4. Using LEFT JOIN

SELECT count(aid),a.bid
FROM pgbench_accounts a
LEFT JOIN pgbench_branches b
  ON a.bid = b.bid AND b.bbalance > 0
WHERE b.bid IS NULL
GROUP BY a.bid;

SELECT count(aid),a.bid

FROM pgbench_accounts a

LEFT JOIN pgbench_branches b

ON a.bid = b.bid AND b.bbalance > 0

WHERE b.bid IS NULL

GROUP BY a.bid;

NOT EXISTS and LEFT JOIN produce better execution plans (hash anti-joins), while NOT IN and <> ALL may generate subplans.

Large Subquery Considerations

With small datasets, PostgreSQL optimizes NOT IN well using hashed subplans. But with large subqueries, performance degrades significantly:

CREATE TABLE t1 AS SELECT * FROM generate_series(0, 500000) id;
CREATE TABLE t2 AS SELECT (random() * 4000000)::integer id FROM generate_series(0, 4000000);

EXPLAIN SELECT id FROM t1 WHERE id NOT IN (SELECT id FROM t2);

CREATE TABLE t1 AS SELECT * FROM generate_series(0, 500000) id;

CREATE TABLE t2 AS SELECT (random() * 4000000)::integer id FROM generate_series(0, 4000000);

EXPLAIN SELECT id FROM t1 WHERE id NOT IN (SELECT id FROM t2);

This results in expensive materialization and poor performance.

Datatype Conversion Considerations

Different syntax can introduce implicit casts:

EXPLAIN ANALYZE SELECT * FROM emp WHERE gen = ANY(ARRAY['M','F']);

1	EXPLAIN ANALYZE SELECT * FROM emp WHERE gen = ANY(ARRAY['M','F']);

This may cast values to text, adding overhead.

Using IN avoids unnecessary casting:

SELECT * FROM emp WHERE gen IN ('M','F');

1	SELECT * FROM emp WHERE gen IN ('M','F');

Summary

PostgreSQL often optimizes different query styles into the same plan
EXISTS and JOIN are generally safer for exclusion queries
IN works well for small subqueries but can degrade with large datasets
Be aware of implicit datatype conversions
Always validate with EXPLAIN

General approach:

Identify required tables
Determine joins
Minimize rows in joins

Never assume performance based on small datasets—test at scale.

Our white paper “Why Choose PostgreSQL?” explores features, benefits, and migration strategies.

Download PDF

0 0 votes

Article Rating

5 Comments

Oldest

Newest Most Voted

Stanislav Sumariuk

6 years ago

I will definitely revisit some of my queries. Thank you for this post.

Scott C.

6 years ago

The exclusion queries are not equivalent and can produce vastly different results. Just try it where the exclusion set contains 100 values, one of which is null.

Francisco Puga

6 years ago

Really nice post.

I know that is centered on performance but as some people points that NOT IN is a don’t [1], include a note about how NULLs are handled in the different options or at least writing a warn about it will be nice.

[1] https://wiki.postgresql.org/wiki/Don't_Do_This

harisai hari

5 years ago

I personally like the first scenario, no problem you can call me a lazy DBA. I always afraid of exclusion queries triggered by tall developers.

Rajat Gupta

5 years ago

Trying to assess a situation where from a heavy table in PostgreSQL, data is fetched using an API and filter is applied making use of an IN clause and the values to filter are passed in a measure of few thousands. Will psql still use the in clause optimally? or some other construct suits well?