Patroni has a REST API that allows HAProxy and other kinds of load balancers to perform HTTP health checks. This blog post explains how HAProxy uses Health check endpoints with Patroni and how to debug the status issue.
HAProxy and Patroni setup:
Sample configuration:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
global maxconn 100 defaults log global mode tcp retries 2 timeout client 30m timeout connect 4s timeout server 30m timeout check 5s listen stats mode http bind *:7000 stats enable stats uri / listen primary bind *:5000 option httpchk OPTIONS /primary http-check expect status 200 default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions server pg0 172.29.0.2:5432 maxconn 100 check port 8008 server pg1 172.29.0.3:5432 maxconn 100 check port 8008 server pg2 172.29.0.4:5432 maxconn 100 check port 8008 listen standbys bind *:5001 option httpchk OPTIONS /replica http-check expect status 200 default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions server pg0 172.29.0.2:5432 maxconn 100 check port 8008 server pg1 172.29.0.3:5432 maxconn 100 check port 8008 server pg2 172.29.0.4:5432 maxconn 100 check port 8008 |
1 2 3 4 5 6 7 8 |
[postgres@node0 sbin]$ patronictl list + Cluster: stampede (7453012617485928545) -----------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +-----------------+------------+---------+-----------+----+-----------+ | cluster1-0 | 172.29.0.2 | Replica | streaming | 2 | 0 | | cluster118870-1 | 172.29.0.4 | Replica | streaming | 2 | 0 | | cluster128215-1 | 172.29.0.3 | Leader | running | 2 | | +-----------------+------------+---------+-----------+----+-----------+ |
- OPTIONS /primary: This is the primary health check endpoint. The Patroni REST API returns HTTP status code 200 only when the Patroni node is running as the primary with leader lock.
- OPTIONS /replica: Replica health check endpoint. Patroni REST API returns HTTP status code 200 only when the Patroni node is in the state running, the role is a replica, and noloadbalance tag is not set.
Here is the complete list of health check endpoints.
HAProxy and Patroni [WARNING] Server is DOWN, code: 503, info: “Service Unavailable”
In some cases, you might see the HAProxy startup giving the following warning.
1 2 |
[WARNING] (16676) : Server primary/pg0 is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 1ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. [WARNING] (16676) : Server standbys/pg1 is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 1ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. |
Root cause
As the HAProxy Patroni REST API endpoints to get the node status with the node’s current role, adding all nodes like the below configuration will cause this Warning.
PG nodes role:
1 2 3 4 5 6 7 8 |
[postgres@node0 sbin]$ patronictl list + Cluster: stampede (7453012617485928545) -----------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +-----------------+------------+---------+-----------+----+-----------+ | cluster1-0 | 172.29.0.2 | Replica | streaming | 2 | 0 | | cluster118870-1 | 172.29.0.4 | Replica | streaming | 2 | 0 | | cluster128215-1 | 172.29.0.3 | Leader | running | 2 | | +-----------------+------------+---------+-----------+----+-----------+ |
Haproxy http-check config:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
listen primary bind *:5000 option httpchk OPTIONS /primary http-check expect status 200 default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions server pg0 172.29.0.2:5432 maxconn 100 check port 8008 server pg1 172.29.0.3:5432 maxconn 100 check port 8008 server pg2 172.29.0.4:5432 maxconn 100 check port 8008 listen standbys bind *:5001 option httpchk OPTIONS /replica http-check expect status 200 default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions server pg0 172.29.0.2:5432 maxconn 100 check port 8008 server pg1 172.29.0.3:5432 maxconn 100 check port 8008 server pg2 172.29.0.4:5432 maxconn 100 check port 8008 |
When we add all PostgreSQL nodes under httpchk primary and replica, for a few node/s Patroni REST API returns different status codes, for example, 503, as the node is running with a different role (Primary/Replica).
You can see these status code details in Patroni logs, as shown in the below example,
Enable DEBUG login for Patroni (patroni conf file) to see API response status messages.
1 2 |
log: level: DEBUG |
Patroni debug log:
1 2 3 4 5 6 7 |
INFO: no action. I am (cluster1-0), a secondary, and following a leader (cluster128215-1) DEBUG: API thread: 172.29.0.2 - - "OPTIONS /replica HTTP/1.0" 200 - latency: 1.128 ms DEBUG: API thread: 172.29.0.2 - - "OPTIONS /leader HTTP/1.0" 503 - latency: 1.160 ms In such cases, we see the following warnings in HAProxy logs: [postgres@node0 sbin]$ ./haproxy -W -f haproxy.cfg [NOTICE] (16674) : New worker (16676) forked [NOTICE] (16674) : Loading success. |
HAProxy Warnings:
1 2 3 |
[WARNING] (16676) : Server primary/pg0 is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 1ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. [WARNING] (16676) : Server primary/pg2 is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 1ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. [WARNING] (16676) : Server standbys/pg1 is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 1ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. |
All nodes, except the current primary/leader node, will show “DOWN” status for the /primary endpoint, and the current primary/leader node will show as “DOWN” for the /replica endpoint.
These [WARNING] messages are harmless & expected; connection via the haproxy port should work fine for the respective endpoint servers.
Further reading: https://patroni.readthedocs.io/en/latest/rest_api.html#health-check-endpoints