Where the open source database community meets: Use code PERCONA75 and secure your spot for Percona Live.  Register

Introducing PostgreSQL 15: Working with SELECT DISTINCT

July 13, 2022
Author
Robert Bernier
Share this Post:

It is that time of year when we once again look at the newest version of PostgreSQL.

As tradition dictates here at Percona, the team is given a list of features to write about. Mine happened to be about a basic and important function: SELECT DISTINCT.

Before getting into the details, I would like to mention a couple of caveats about how the results were derived for this blog:

  1. The tables are small and use a simple architecture.
  2. Because this demonstration was performed on a relatively low-powered system, the real-world metrics have the potential to be significantly greater than what is demonstrated here.

What is the PostgreSQL SELECT DISTINCT Clause?

For those new to Postgres, and to the ANSI SQL standard, the SELECT DISTINCT statement eliminates duplicate rows from the result set by matching specified expressions. PostgreSQL DISTINCT keeps one row from each group of duplicates.

For example, given the following table:

This SQL statement returns records by filtering unique values found in column “c1” in sorted order:

Notice, as indicated by column “c2”, that uniqueness on c1 returns the first value found in the table:

This SQL statement returns records by filtering unique values found in column “c2”:

And finally, returning uniqueness for the entire row:

So what is the special new enhancement of DISTINCT? The answer is that it has been parallelized!

In the past, only a single CPU or process was used to count the number of distinct records. However, in Postgres 15, you can now break up the task by running multiple workers in parallel, each assigned to a separate CPU process. Several runtime parameters control this behavior, but the one we will focus on is max_parallel_workers_per_gather.

PostgreSQL DISTINCT Clause Examples: Generating Metrics

Let’s generate some metrics.

To demonstrate the improved performance, three tables were created without indexes and populated with approximately 5,000,000 records. Notice the number of columns for each table: one, five, and 10, respectively.

The next step is to copy the aforementioned data dumps into the following versions of Postgres:

The Postgres binaries were compiled from source, and data clusters were created on the same low-powered hardware using the default, untuned runtime configuration values.

Once populated, the following bash script was executed to generate the results:

Here are the results. The larger the tables become, the greater the performance gains that can be achieved.

PG VERSION 1 column (t1), ms 5 column (t5), ms 10 column (t10), ms
pg96 3,382 9,743 20,026
pg10 2,004 5,746 13,241
pg11 1,932 6,062 14,295
pg12 1,876 5,832 13,214
pg13 1,973 2,358 3,135
pg14 1,948 2,316 2,909
pg15 1,439 1,025 1,245

Query Plan

One of the more interesting aspects of the investigation was reviewing the query plans between different versions of Postgres. For example, the query plan for a single-column DISTINCT was quite similar between the Postgres 9.6 and Postgres 15 plans, ignoring the superior execution time.

The real difference appeared when the number of DISTINCT columns increased, as demonstrated by querying table t10. Here, parallelization is visible in the query plan.

Increasing Performance

Performance enhancements were made by updating the Postgres runtime parameter max_parallel_workers_per_gather. The default value in a newly initialized cluster is 2. As the table below indicates, performance quickly reached diminishing returns due to the restricted capabilities of the testing hardware.

POSTGRES VERSION 15

max_parallel_workers_per_gather 1 column (t1) 5 column (t5) 10 column (t10)
2 1,439 1,025 1,245
3 1,464 875 1,013
4 1,391 858 977
6 1,401 846 1,045
8 1,428 856 993

PostgreSQL Distinct

About Indexes

Performance improvements were not realized when indexes were applied as demonstrated in this query plan.

PG15, table t10 with 10 DISTINCT columns and max_parallel_workers_per_gather=4:

Concluding Thoughts

Running DISTINCT across multiple CPUs is a significant performance improvement. However, keep in mind the risk of diminishing returns as you increase max_parallel_workers_per_gather and approach your hardware’s limitations. As the investigation showed, under normal circumstances, the query planner may decide to use indexes instead of running parallel workers. One way to work around this is to consider disabling runtime parameters such as enable_indexonlyscan and enable_indexscan. Finally, run EXPLAIN ANALYZE to understand what is happening.

Get Percona Support for PostgreSQL

Percona supports DBAs and developers seeking help with their PostgreSQL databases.

Percona Support for PostgreSQL

Additional Percona support and resources include:

Percona Distribution for PostgreSQL Software

Percona Platform

Percona Managed Services

Percona Professional Services

0 0 votes
Article Rating
Subscribe
Notify of
guest

6 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Oliver
3 years ago

Good to see, thank you.

I see that the query is spilling to disk – from the query plan:

Sort Method: external merge Disk: 68432kB

It would be interesting to see the timing results if the machine is configured with enough memory and/or suitable work_mem etc. parameters so that that doesn’t happen, which would give a more representative use. Otherwise the disk activity could swamp pretty much every other factor.

Peter Gram
3 years ago

you write “This SQL statement returns those records filtering out the UNIQUE values found in column “c1” in SORTED order: select distinct on(c1) * from t_ex;”.
The SQL standard is open to if distinct should return data sorted or random since this i implementation dependent. Oracle used to return values from “distinct” and “group by” sorted but from version 9.x this is not true longer since they started to use a hash parallel function and after this values are not sorted.
The official statement is if you need data in sorted order use a “order by”.

Fazal Majid
3 years ago

It’s worth mentioning that at scale no one uses SELECT DISTINCT but instead fast approximate methods like HyperLogLog as implemented in CitusData’s postgresql-hll:

https://github.com/citusdata/postgresql-hll

Far
Enough.

Said no pioneer ever.
MySQL, PostgreSQL, InnoDB, MariaDB, MongoDB and Kubernetes are trademarks for their respective owners.
© 2026 Percona All Rights Reserved