Unlike the standard multi-node Postgres replication cluster, when managed by Patroni, all failovers are automatically executed. However, this is not the case when dealing with inter-datacentre failovers when for instance a standby datacentre must take over from a failed primary.
The following describes the mechanisms required to perform such a procedure when the case arises.
Herein two mechanism are described, both of which administers the clusters’ DCS:
- Failover execution via patronictl
- Failover execution via Patroni’s REST API
About patronictl
This is the base command permitting changes to the DCs.
|
1 |
patronictl -c /etc/patroni/config.yml edit-config --help |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
Usage: patronictl edit-config [OPTIONS] [CLUSTER_NAME] Edit cluster configuration Options: --group INTEGER Citus group -q, --quiet Do not show changes -s, --set TEXT Set specific configuration value. Can be specified multiple times -p, --pg TEXT Set specific PostgreSQL parameter value. Shorthand for -s postgresql.parameters. Can be specified multiple times --apply TEXT Apply configuration from file. Use - for stdin. --replace TEXT Apply configuration from file, replacing existing configuration. Use - for stdin. --force Do not ask for confirmation at any point --help Show this message and exit. |
About REST API
Patroni’s REST API is integral to Patroni’s leader election process for operations like failovers, switchovers, reinitializations, restarts, and reloads. It can also be employed by load balancers such as HAProxy for HTTP health checks as well for monitoring too.
Typically the CLI curl and jq are used when performing operations with the REST API.
Setup
For the purposes of documentation, assume the following: Two datacentres comprising a 3 node replication cluster in each datacentre. Frankfurt is currently configured as the PRIMARY and AMSTERDAM the STANDBY datacentre respectively.
|
1 2 3 4 5 6 7 |
+ Cluster: Frankfurt (7546601820334116441) ------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +---------+----------------+---------+-----------+----+-----------+ | fradb1v | 10.108.134.168 | Leader | running | 2 | | | fradb2v | 10.108.134.113 | Replica | streaming | 2 | 0 | | fradb3v | 10.108.134.150 | Replica | streaming | 2 | 0 | +---------+----------------+---------+-----------+----+-----------+ |
|
1 2 3 4 5 6 7 |
+ Cluster: Amsterdam (7546601820334116441) -+-----------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +---------+----------------+----------------+-----------+----+-----------+ | amsdb1v | 10.108.134.211 | Standby Leader | streaming | 2 | | | amsdb2v | 10.108.134.235 | Replica | streaming | 2 | 0 | | amsdb3v | 10.108.134.29 | Replica | streaming | 2 | 0 | +---------+----------------+----------------+-----------+----+-----------+ |
Standby Leader Promotion
Two steps are required to promote the standby leader:
- Promote the current standby leader
- Drop the slot used by the remotely connected primary
Version 1: Using patronictl
Promote Standby Leader
The only difference between these two variants is that one is interactive and the other is not.
Variation A
|
1 2 |
# interactive; asks before commiting patronictl -c /etc/patroni/config.yml edit-config --set standby_cluster=null |
Variation B
|
1 2 |
# forced execution patronictl -c /etc/patroni/config.yml edit-config --set standby_cluster=null --force |
Create New Slot
This could be considered an optional activity where a replication slot is explicitly created with the understanding that a new Standby datacentre is to be provisioned i.e. the failed primary datacentre.
|
1 2 |
patronictl -c /etc/patroni/config.yml edit-config --set slots.standby_cluster.type=physical --force |
Attention: This and the previous command can be combined as a single command. For only documentation purposes were they split into two separate ones.
Version 2: Using REST API
The benefit of this method is that it can be executed from any host that can access any of the cluster’s port 8008 which is managed by Patroni.
Promote Standby Leader
The host is promoted, recovery is completed and becomes read-write.
|
1 |
curl -s -XPATCH -d '{"standby_cluster":null}' http://localhost:8008/config | jq . |
Create New Slot
As per the previously demonstrated example, a new slot is created on the new Leader for the purposes of replicating to a Standby Leader.
|
1 2 3 4 |
curl -s -XPATCH -d '{ "slots":{"standby_cluster":{"type": "physical"}} }' http://localhost:8008/config | jq . |
Reprovision Deprecated Primary Datacentre As The New Standby
The following instructions are somewhat similar to the previous ones. Execute the following in Patroni cluster “Frankfurt”.
Version 1: patronictl
Provision New Standby Leader
|
1 2 3 4 5 6 |
patronictl -c /etc/patroni/config.yml edit-config --set standby_cluster.host='amsdb1v,amsdb2v,amsdb3v' --set standby_cluster.port=5432 --set standby_cluster.primary_slot_name=standby_cluster --set standby_cluster.create_replica_methods=basebackup --force |
Drop Old Slot
Unless there’s an acute reason requiring the slot to remain on the newly promoted Leader it should be removed.
|
1 |
patronictl -c /etc/patroni/config.yml edit-config --set slots=null --force |
Attention: attempting to remove the slot using the postgres function called pg_drop_replication_slot will fail because patroni will simply put it back.
Version 2: REST API
Provision New Standby Leader
|
1 2 3 4 5 6 7 8 9 10 11 |
curl -s -XPATCH -d '{ "standby_cluster": { "host": "amsdb1v,amsdb2v,amsdb3v", "port": 5432, "primary_slot_name": "standby_cluster", "create_replica_methods": [ "basebackup" ] } }' http://localhost:8008/config | jq . |
Drop Old Slot
|
1 |
curl -s -XPATCH -d '{"slots":null}' http://localhost:8008/config | jq . |
Caveat
- Remember, you are administering two distinct replication clusters i.e. Primary and Standby.
- Using patronictl requires access to one of the hosts in each respective cluster.
- Using the REST API
- presumes Patroni’s port 8008 can be reached when using command line interfaces such as curl.
- Because access to the port is required it presents a potential security risk therefore either TLS with a password is used or the firewall rule must be configured accordingly.
- It goes without saying … watch out for split brain scenarios
Finally: here’s a previous blog on Patroni disaster recovery which also explains the basic thinking behind the technology.