Introducing PostgreSQL 15: Working with SELECT DISTINCT

July 13, 2022
Author
Robert Bernier
Share this Post:

Well, it’s that time of the year when once again we have a look at the newest version of PostgreSQL.


As tradition dictates, here at Percona, the team is given a list of features to write about. Mine happened to be about a very basic and, I might add, important function i.e. SELECT DISTINCT.


Before getting into the details I’d like to mention a couple of caveats regarding how the results were derived for this blog:



  1. The tables are pretty small and of a simple architecture.

  2. Because this demonstration was performed upon a relatively low-powered system, the real metrics have the potential of being significantly greater than what is demonstrated.


What is the PostgreSQL SELECT DISTINCT Clause?


For those new to postgres, and the ANSI SQL standard for that matter, the SELECT DISTINCT statement eliminates duplicate rows from the result set by matching specified expressions. (PostgreSQL DISTINCT keeps one entry row from each group of duplicates.)


For example, given the following table:



This SQL statement returns those records filtering out the UNIQUE values found in column “c1” in SORTED order:



Notice, as indicated by column “c2”, that c1 uniqueness returns the first value found in the table:



This SQL statement returns those records filtering out UNIQUE values found in column “c2”




And finally, of course, returning uniqueness for the entire row:




So what’s this special new enhancement of DISTINCT you ask? The answer is that it’s been parallelized!


In the past, only a single CPU/process was used to count the number of distinct records. However, in postgres version 15 one can now break up the task of counting by running multiple numbers of workers in parallel each assigned to a separate CPU process. There are a number of runtime parameters controlling this behavior but the one we’ll focus on is max_parallel_workers_per_gather.


PostgreSQL DISTINCT Clause examples: generating metrics


Let’s generate some metrics!


In order to demonstrate this improved performance three tables were created, without indexes, and populated with approximately 5,000,000 records. Notice the number of columns for each table i.e. one, five, and 10 respectively:









The next step is to copy the aforementioned data dumps into the following versions of postgres:



The postgres binaries were compiled from the source and data clusters were created on the same low-powered hardware using the default, and untuned, runtime configuration values.


Once populated, the following bash script was executed to generate the results:



And here are the results: One can see that the larger the tables become the greater the performance gains that can be achieved.






















































PG VERSION



1 column (t1), ms



5 column (t5), ms



10 column (t10), ms



pg96



3,382



9,743



20,026



pg10



2,004



5,746



13,241



pg11



1,932



6,062



14,295



pg12



1,876



5,832



13,214



pg13



1,973



2,358



3,135



pg14



1,948



2,316



2,909



pg15



1,439



1,025



1,245



 



Query plan


One of the more interesting aspects of the investigation was reviewing the query plans between the different versions of postgres. For example, the query plan for a single column DISTINCT was actually quite similar, ignoring the superior execution time of course, between the postgres 9.6 and 15 plans respectively.




The real difference showed up when the number of DISTINCT columns were increased, as demonstrated by querying table t10. One can see parallelization in action!




Increasing the performance


Performance enhancements were made by updating the postgres runtime parameter max_parallel_workers_per_gather. The default value in a newly initialized cluster is 2. As the table below indicates, it quickly became an issue of diminishing returns due to the restricted capabilities of the testing hardware itself.


POSTGRES VERSION 15










































max_parallel_workers_per_gather



1 column (t1)



5 column (t5)



10 column (t10)



2



1,439



1,025



1,245



3



1,464



875



1,013



4



1,391



858



977



6



1,401



846



1,045



8



1,428



856



993



PostgreSQL Distinct


About indexes 


Performance improvements were not realized when indexes were applied as demonstrated in this query plan.


PG15, TABLE T10(10 DISTINCT columns), and max_parallel_workers_per_gather=4:



Concluding thoughts


Running DISTINCT across multiple CPUs is a big advance in performance capabilities. But keep in mind the risk of diminishing performance as you increase the number of max_parallel_workers_per_gather and you approach your hardware’s limitations. And as the investigation showed, under normal circumstances, the query planner might decide to use indexes instead of running parallel workers. One way to get around this is to consider disabling runtime parameters such as enable_indexonlyscan and enable_indexscan. Finally, don’t forget to run EXPLAIN ANALYZE in order to understand what’s going on.


Get Percona Support for PostgreSQL


Percona supports DBAs and developers seeking help with their PostgreSQL databases.


Percona Support for PostgreSQL


Additional Percona support and resources include:


Percona Distribution for PostgreSQL Software


Percona Platform


Percona Managed Services


Percona Professional Services


Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Far
Enough.

Said no pioneer ever.
MySQL, PostgreSQL, InnoDB, MariaDB, MongoDB and Kubernetes are trademarks for their respective owners.
© 2026 Percona All Rights Reserved