Patroni has a REST API that allows HAProxy and other kinds of load balancers to perform HTTP health checks. This blog post explains how HAProxy uses Health check endpoints with Patroni and how to debug the status issue.
Sample configuration:
|
1 |
global<br> maxconn 100<br><br>defaults<br> log global<br> mode tcp<br> retries 2<br> timeout client 30m<br> timeout connect 4s<br> timeout server 30m<br> timeout check 5s<br><br>listen stats<br> mode http<br> bind *:7000<br> stats enable<br> stats uri /<br><br>listen primary<br> bind *:5000<br> option httpchk OPTIONS /primary<br> http-check expect status 200<br> default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions<br> server pg0 172.29.0.2:5432 maxconn 100 check port 8008<br> server pg1 172.29.0.3:5432 maxconn 100 check port 8008<br> server pg2 172.29.0.4:5432 maxconn 100 check port 8008<br><br>listen standbys<br> bind *:5001<br> option httpchk OPTIONS /replica<br> http-check expect status 200<br> default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions<br> server pg0 172.29.0.2:5432 maxconn 100 check port 8008<br> server pg1 172.29.0.3:5432 maxconn 100 check port 8008<br> server pg2 172.29.0.4:5432 maxconn 100 check port 8008 |
|
1 |
[postgres@node0 sbin]$ patronictl list<br>+ Cluster: stampede (7453012617485928545) -----------+----+-----------+<br>| Member | Host | Role | State | TL | Lag in MB |<br>+-----------------+------------+---------+-----------+----+-----------+<br>| cluster1-0 | 172.29.0.2 | Replica | streaming | 2 | 0 |<br>| cluster118870-1 | 172.29.0.4 | Replica | streaming | 2 | 0 |<br>| cluster128215-1 | 172.29.0.3 | Leader | running | 2 | |<br>+-----------------+------------+---------+-----------+----+-----------+ |
Here is the complete list of health check endpoints.
In some cases, you might see the HAProxy startup giving the following warning.
|
1 |
[WARNING] (16676) : Server primary/pg0 is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 1ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.<br>[WARNING] (16676) : Server standbys/pg1 is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 1ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. |
As the HAProxy Patroni REST API endpoints to get the node status with the node’s current role, adding all nodes like the below configuration will cause this Warning.
PG nodes role:
|
1 |
[postgres@node0 sbin]$ patronictl list<br>+ Cluster: stampede (7453012617485928545) -----------+----+-----------+<br>| Member | Host | Role | State | TL | Lag in MB |<br>+-----------------+------------+---------+-----------+----+-----------+<br>| cluster1-0 | 172.29.0.2 | Replica | streaming | 2 | 0 |<br>| cluster118870-1 | 172.29.0.4 | Replica | streaming | 2 | 0 |<br>| cluster128215-1 | 172.29.0.3 | Leader | running | 2 | |<br>+-----------------+------------+---------+-----------+----+-----------+ |
Haproxy http-check config:
|
1 |
listen primary<br> bind *:5000<br> option httpchk OPTIONS /primary<br> http-check expect status 200<br> default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions<br> server pg0 172.29.0.2:5432 maxconn 100 check port 8008<br> server pg1 172.29.0.3:5432 maxconn 100 check port 8008<br> server pg2 172.29.0.4:5432 maxconn 100 check port 8008<br><br>listen standbys<br> bind *:5001<br> option httpchk OPTIONS /replica<br> http-check expect status 200<br> default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions<br> server pg0 172.29.0.2:5432 maxconn 100 check port 8008<br> server pg1 172.29.0.3:5432 maxconn 100 check port 8008<br> server pg2 172.29.0.4:5432 maxconn 100 check port 8008 |
When we add all PostgreSQL nodes under httpchk primary and replica, for a few node/s Patroni REST API returns different status codes, for example, 503, as the node is running with a different role (Primary/Replica).
You can see these status code details in Patroni logs, as shown in the below example,
Enable DEBUG login for Patroni (patroni conf file) to see API response status messages.
|
1 |
log:<br> level: DEBUG |
Patroni debug log:
|
1 |
INFO: no action. I am (cluster1-0), a secondary, and following a leader (cluster128215-1)<br>DEBUG: API thread: 172.29.0.2 - - "OPTIONS /replica HTTP/1.0" 200 - latency: 1.128 ms<br>DEBUG: API thread: 172.29.0.2 - - "OPTIONS /leader HTTP/1.0" 503 - latency: 1.160 ms<br>In such cases, we see the following warnings in HAProxy logs:<br>[postgres@node0 sbin]$ ./haproxy -W -f haproxy.cfg<br>[NOTICE] (16674) : New worker (16676) forked<br>[NOTICE] (16674) : Loading success. |
HAProxy Warnings:
|
1 |
[WARNING] (16676) : Server primary/pg0 is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 1ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.<br>[WARNING] (16676) : Server primary/pg2 is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 1ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.<br>[WARNING] (16676) : Server standbys/pg1 is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 1ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. |
All nodes, except the current primary/leader node, will show “DOWN” status for the /primary endpoint, and the current primary/leader node will show as “DOWN” for the /replica endpoint.
These [WARNING] messages are harmless & expected; connection via the haproxy port should work fine for the respective endpoint servers.
Further reading: https://patroni.readthedocs.io/en/latest/rest_api.html#health-check-endpoints
Resources
RELATED POSTS