Connection Queuing in pgBouncer: Is it a Magical Remedy?

Yes, this post is about connection queueing, not just pooling. Because “connection pooling” – pre-created connections as a pool – is a much-celebrated feature. Almost every discussion on connection pool/pgBouncer starts with the overhead of establishing a new connection to PostgreSQL… and how pre-created connections in the pool can save the world.

But there is a non-celebrity feature in pgBouncer (not denying others) that can address some of the real big operational challenges. It is the connection queueing. Many new PostgreSQL users don’t know there is something like this. In this blog post, I am planning to discuss some of the problems related to connection management first, and then explore how connection queueing can address those problems.

Problem 1: Spike in Load can Jam and Halt the Server

PostgreSQL has a dedicated backend server process for every user connection. So there will be one backend process running on a CPU core for every active queries/sessions. This means there is a one-to-one mapping between active sessions and running processes in the server. If we consider parallel execution of SQL statements, there will be many more running processes than active sessions. In many real-world cases, a sudden spike in load can result in hundreds of active queries starting at once while the server is equipped with a small number of CPUs (sometimes just virtual CPUs with only fraction of performance). As the number of active sessions/processes increases, the overhead of scheduling and context switches takes over.  Many times, the host server becomes unresponsive, and even opening a bash shell/terminal can take time. This is quite easy to simulate. Just 10 active connections on a two virtual CPU server with SELECT only workload can cause this.

With two active sessions:

When there are 10 active sessions on PostgreSQL, just establishing an ssh connection to the server took 15 seconds.

**These are indicative numbers from a very specific system and do not qualify for a benchmark.

Generally, we could see that as the number of active sessions approaches double the number of CPU cores the performance penalty starts increasing heavily.

Many times, the problem won’t end there. Session level resource allocations (work_mem, temporary tables, etc.) can lead to overall server resource consumption. As the host server slows down, each session will take more time to complete while holding the resources, which could lead to more accumulation of active sessions. It is a spiral of evil. There are many real-world cases, where the entire show ended in a complete halt of the host server or OOM kick-in, terminating the PostgreSQL process and forcing it for crash recovery.

Problem 2: “Too Many Clients Already” Errors

Few smart DBAs will prevent this database disaster by setting max_connections properly to a smaller value than the database can handle, which is the right thing to do from a DB server perspective. Allowing an excessive number of connections to the database can lead to different types of abuses, attacks, and disasters. But the flip side to it is an abusive application may be greeted with the message as follows:

The same will be logged in the PostgreSQL log.

Unfortunately, this could lead to an application crash or misbehavior. From the business point of view, we just shifted the problem from database to application.

Problem 3: Big max_connection Value and Overhead

Because of the above-mentioned problem, it is common to see max_connection to have a very high value. The overhead of connections is probably one of the most actively-discussed topics these days because Postgres 14 is expected to have some of the connection scalability fixes. Andres Freund blogged about the details of analyzing the connection scalability problem and how it is addressed.

Even the idling connection may occupy server resources like memory. The overhead is considered as very low on a properly configured server; however, the impact could be heavy in reality. Again, a lot of things depend on the workload. There are at least a few cases that reported up to 50MB consumption per session. That means 500 idle connections can result in up to 25GB of memory usage.

In addition to this, more connections can lead to more lock management-related overheads. And don’t forget that system becomes vulnerable to sudden spikes as the max_connections are increased.

Solution: Connection Queueing

At the very least, connection queueing is the queueing of connections so that they can absorb the sudden spike in load. The connections can be put into a queue rather than straight away rejecting or sending it to the server and jamming it. This results in streamlining the execution. PostgreSQL server can keep doing what it can do rather than dealing with a jam situation.

Let me demonstrate with an example. For this demonstration, I set the max_connections to “2”, assuming that this is the maximum the server can accommodate without causing too many context switches. Too many connections won’t come and overload my database.

A third connection to the database will result in an error as expected.

Now let’s use the pgbouncer for the connection queue. Many of users may not be knowing that it exists, by default. I used the following pgbouncer configuration for testing:

Yes, the pooler will establish only one connection to the database. pgBouncer establishes this one connection when the client connection establishes for the first time because the min_pool_size is 1. The pgBouncer log says:

pgbouncer pool statistics also shows the details:

But the beauty is that we won’t get any more “FATAL: sorry, too many clients already” errors. All client connections are accepted and put into the connection queue. For example, I have five client connections. please see the value of cl_active:

As each client connection becomes active (with a SQL statement), they will be put into waiting.

Each client connection will be executed over the available database connection, one after another. This is a case with a single database connection. If the connection count and pool size can be increased, multiple client connections can hit the server at the same time and queue size drops. The following is a case with two connections (pool size two) to the database.