SQL Optimizations in PostgreSQL: IN vs EXISTS vs ANY/ALL vs JOIN

SQL optimizations in PostgreSQLThis is one of the most common questions asked by developers who write SQL queries against the PostgreSQL database. There are multiple ways in which a sub select or lookup can be framed in a SQL statement. PostgreSQL optimizer is very smart at optimizing queries, and many of the queries can be rewritten/transformed for better performance.

Let’s discuss the topic with an example, for which I am using schema created by pgbench.

Note: For those not familiar with pgbench, it is a micro benchmarking tool shipped with PostgreSQL. A sample pgbench schema can be initialized with some data as follows:

For this example, I have updated the branch balance of a couple of branches:

INclusion Queries

The SQL Challenge for this example is: Find out the number of accounts per branch from pgbench_accounts for those branches where branch level balance is greater than zero. This query can be written in four different ways, as per ANSI SQL Standards.

1.  Using IN Clause

2. Using ANY Clause

3. Using EXISTS Clause

4. Using INNER JOIN

While writing the query, one might assume that EXISTS and INNER JOIN might be better because they can use all the logic and optimization for joining two tables, while IN and ANY clauses need to deal with subqueries. However, PostgreSQL (at least PG 10 and above) is smart enough to produce the same execution plan for all four options!.

All of the above queries will be generating the same execution plan as follows:

Note: Suppress the parallel execution for better readability and a simple execution plan. Even with a parallel execution plan, all the queries are producing the same execution plan.

So can we conclude that we can write the query as we are comfortable and PostgreSQL’s intelligence will take care of the rest? Wait! Things can go differently if we take the exclusion scenario.

Exclusion Queries

The SQL challenge becomes: Find out the number of accounts per branch from pgbench_accounts EXCEPT for those branches where branch level balance is greater than zero.

So the four ways to write queries becomes:

1. Using NOT IN

2. Using <> ALL

3. Using NOT EXISTS

4. Using LEFT JOIN and IS NULL

The “NOT IN” and “<> ALL” produces an execution plan with sub-queries (SubPlan). They are respectively:

and

While NOT EXISTS and LEFT JOIN produces the same execution plan without a sub-plan as follows:

These direct hash (anti) joins between the tables is the smartest way to answer the query. So this stands as a strong reason for recommending the EXISTS syntax or JOIN syntax. So the general rule of thumb favoring EXISTS/JOINs is holding good.

But wait! Do we see a better execution time with the NOT IN clause even with a sub-plan? Yes. PostgreSQL has done excellent optimization, thereby preparing a Hash of sub-plan NOT (hashed SubPlan 1). So PostgreSQL has a better understanding of how to deal with an IN clause, which is the logical way of thinking, as many people tend to write with IN clause. But we have very few rows (two) returned by sub-plan. The same happens even if the subquery returns a few hundred rows.

But what if there is a large number of rows (few hundreds of thousands of rows) returned by subquery? Let’s try a simple example:

In this case, the execution plan is:

In this case, the execution plan switches to the materialization of the result of the sub-plan, and the estimated cost jumps to 25831564501.02! (With PostgreSQL default settings, if the number of rows from t2 is lesser than 100k approximately, it uses the hashed sub-plan as we discussed.)
This will result in substantial degradation of performance. So the IN clause works great if the sub-plan selects a fewer number of rows.

The catch here is when development happens, there will be fewer rows in tables, and it works differently as the number of rows increases, as the execution plan drifts and can result in big performance issues in live production.

Is There More Complexity We Should Be Aware Of?

Yes, there could be datatype conversions happening when we write the query in a different way.

For example, a statement like:

is resulting in implicit datatype conversion of the values of the fields to text.

Please note the datatype conversion: (gen)::text . On a big table, this type of conversion will have overhead, whereas PostgreSQL does a better job in dealing with the IN clause.

Even though the IN clause is converted into the ANY clause, there is no data type conversion of the “gen” field. And the specified values ‘M’,’F’ are converted into bpchar, which is an internal equivalent of CHAR.

Summary

My intention while writing this blog post is not to favor any particular way of writing a query, but to shed some light on where things can go wrong and what should be considered.

In general, I used to suggest to developers that the key to writing a good SQL statement is to follow step by step process.

  1. First, make a list of tables from which the data should be retrieved.
  2. Then think about how to JOIN those tables.
  3. Think about how to have the minimum records participating in the join condition.

Avoid thinking from “How to break the logic” into subqueries.

Never assume that the query is performing well with a small amount of data in the table.

Use an EXPLAIN plan to understand what is going on in the background.

In general, EXISTS and direct JOIN of tables often results in good results. PostgreSQL optimizes the IN clause to a hashed sub-plan in many cases. “IN” can result in a better plan and execution in some specific situations. Again, everything depends on how a query is rewritten/transformed internally. It is worth investing time in rewriting queries for better optimization.

 


Our white paper “Why Choose PostgreSQL?” looks at the features and benefits of PostgreSQL and presents some practical usage examples. We also examine how PostgreSQL can be useful for companies looking to migrate from Oracle.

Download PDF

Share this post

Comments (3)

  • Stanislav Sumariuk Reply

    I will definitely revisit some of my queries. Thank you for this post.

    April 18, 2020 at 4:46 am
  • Scott C. Reply

    The exclusion queries are not equivalent and can produce vastly different results. Just try it where the exclusion set contains 100 values, one of which is null.

    April 18, 2020 at 10:28 am
  • Francisco Puga Reply

    Really nice post.

    I know that is centered on performance but as some people points that NOT IN is a don’t [1], include a note about how NULLs are handled in the different options or at least writing a warn about it will be nice.

    [1] https://wiki.postgresql.org/wiki/Don't_Do_This

    June 23, 2020 at 9:16 am

Leave a Reply