Unlike the standard multi-node Postgres replication cluster, when managed by Patroni, all failovers are automatically executed. However, this is not the case when dealing with inter-datacentre failovers when for instance a standby datacentre must take over from a failed primary.

The following describes the mechanisms required to perform such a procedure when the case arises.

Herein two mechanism are described, both of which administers the clusters’ DCS:

  1. Failover execution via patronictl
  2. Failover execution via Patroni’s REST API

About patronictl

This is the base command permitting changes to the DCs.

patronictl -c /etc/patroni/config.yml edit-config --help
Usage: patronictl edit-config [OPTIONS] [CLUSTER_NAME]

  Edit cluster configuration

Options:
  --group INTEGER  Citus group
  -q, --quiet      Do not show changes
  -s, --set TEXT   Set specific configuration value. Can be specified multiple
                   times
  -p, --pg TEXT    Set specific PostgreSQL parameter value. Shorthand for -s
                   postgresql.parameters. Can be specified multiple times
  --apply TEXT     Apply configuration from file. Use - for stdin.
  --replace TEXT   Apply configuration from file, replacing existing
                   configuration. Use - for stdin.
  --force          Do not ask for confirmation at any point
  --help           Show this message and exit.

About REST API

Patroni’s REST API is integral to Patroni’s leader election process for operations like failovers, switchovers, reinitializations, restarts, and reloads. It can also be employed by load balancers such as HAProxy for HTTP health checks as well for monitoring too.

Typically the CLI curl and jq are used when performing operations with the REST API.

Setup

For the purposes of documentation, assume the following: Two datacentres comprising a 3 node replication cluster in each datacentre. Frankfurt is currently configured as the PRIMARY and AMSTERDAM the STANDBY datacentre respectively.

+ Cluster: Frankfurt (7546601820334116441) ------+----+-----------+
| Member  | Host           | Role    | State     | TL | Lag in MB |
+---------+----------------+---------+-----------+----+-----------+
| fradb1v | 10.108.134.168 | Leader  | running   |  2 |           |
| fradb2v | 10.108.134.113 | Replica | streaming |  2 |         0 |
| fradb3v | 10.108.134.150 | Replica | streaming |  2 |         0 |
+---------+----------------+---------+-----------+----+-----------+
+ Cluster: Amsterdam (7546601820334116441) -+-----------+----+-----------+
| Member  | Host           | Role           | State     | TL | Lag in MB |
+---------+----------------+----------------+-----------+----+-----------+
| amsdb1v | 10.108.134.211 | Standby Leader | streaming |  2 |           |
| amsdb2v | 10.108.134.235 | Replica        | streaming |  2 |         0 |
| amsdb3v | 10.108.134.29  | Replica        | streaming |  2 |         0 |
+---------+----------------+----------------+-----------+----+-----------+

Standby Leader Promotion

Two steps are required to promote the standby leader:

  1. Promote the current standby leader
  2. Drop the slot used by the remotely connected primary

Version 1: Using patronictl

Promote Standby Leader

The only difference between these two variants is that one is interactive and the other is not.

Variation A

# interactive; asks before commiting
patronictl -c /etc/patroni/config.yml edit-config --set standby_cluster=null

Variation B

# forced execution
patronictl -c /etc/patroni/config.yml edit-config --set standby_cluster=null --force

Create New Slot

This could be considered an optional activity where a replication slot is explicitly created with the understanding that a new Standby datacentre is to be provisioned i.e. the failed primary datacentre.

patronictl -c /etc/patroni/config.yml edit-config 
           --set slots.standby_cluster.type=physical --force

Attention: This and the previous command can be combined as a single command. For only documentation purposes were they split into two separate ones.

Version 2: Using REST API

The benefit of this method is that it can be executed from any host that can access any of the cluster’s port 8008 which is managed by Patroni. 

Promote Standby Leader

The host is promoted, recovery is completed and becomes read-write.

curl -s -XPATCH -d '{"standby_cluster":null}' http://localhost:8008/config | jq .

Create New Slot

As per the previously demonstrated example, a new slot is created on the new Leader for the purposes of replicating to a Standby Leader.

curl -s -XPATCH -d 
'{
  "slots":{"standby_cluster":{"type": "physical"}}
}' http://localhost:8008/config | jq .

Reprovision Deprecated Primary Datacentre As The New Standby

The following instructions are somewhat similar to the previous ones. Execute the following in Patroni cluster “Frankfurt”.

Version 1: patronictl

Provision New Standby Leader

patronictl -c /etc/patroni/config.yml edit-config 
    --set standby_cluster.host='amsdb1v,amsdb2v,amsdb3v' 
    --set standby_cluster.port=5432 
    --set standby_cluster.primary_slot_name=standby_cluster 
    --set standby_cluster.create_replica_methods=basebackup 
    --force

Drop Old Slot

Unless there’s an acute reason requiring the slot to remain on the newly promoted Leader it should be removed.

patronictl -c /etc/patroni/config.yml edit-config --set slots=null --force

Attention: attempting to remove the slot using the postgres function called pg_drop_replication_slot will fail because patroni will simply put it back.

Version 2: REST API

Provision New Standby Leader

curl -s -XPATCH -d 
'{
  "standby_cluster": {
    "host": "amsdb1v,amsdb2v,amsdb3v",
    "port": 5432,
    "primary_slot_name": "standby_cluster",
    "create_replica_methods": [
      "basebackup"
    ]
  }
}' http://localhost:8008/config | jq .

Drop Old Slot

curl -s -XPATCH -d '{"slots":null}' http://localhost:8008/config | jq .

Caveat

  • Remember, you are administering two distinct replication clusters i.e. Primary and Standby.
  • Using patronictl requires access to one of the hosts in each respective cluster.
  • Using the REST API
    • presumes Patroni’s port 8008 can be reached when using command line interfaces such as curl. 
    • Because access to the port is required it presents a potential security risk therefore either TLS with a password is used or the firewall rule must be configured accordingly.
  • It goes without saying … watch out for split brain scenarios

Finally: here’s a previous blog on Patroni disaster recovery which also explains the basic thinking behind the technology.

Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments