Unlike the standard multi-node Postgres replication cluster, when managed by Patroni, all failovers are automatically executed. However, this is not the case when dealing with inter-datacentre failovers when for instance a standby datacentre must take over from a failed primary.

The following describes the mechanisms required to perform such a procedure when the case arises.

Herein two mechanism are described, both of which administers the clusters’ DCS:

  1. Failover execution via patronictl
  2. Failover execution via Patroni’s REST API

About patronictl

This is the base command permitting changes to the DCs.

About REST API

Patroni’s REST API is integral to Patroni’s leader election process for operations like failovers, switchovers, reinitializations, restarts, and reloads. It can also be employed by load balancers such as HAProxy for HTTP health checks as well for monitoring too.

Typically the CLI curl and jq are used when performing operations with the REST API.

Setup

For the purposes of documentation, assume the following: Two datacentres comprising a 3 node replication cluster in each datacentre. Frankfurt is currently configured as the PRIMARY and AMSTERDAM the STANDBY datacentre respectively.

Standby Leader Promotion

Two steps are required to promote the standby leader:

  1. Promote the current standby leader
  2. Drop the slot used by the remotely connected primary

Version 1: Using patronictl

Promote Standby Leader

The only difference between these two variants is that one is interactive and the other is not.

Variation A

Variation B

Create New Slot

This could be considered an optional activity where a replication slot is explicitly created with the understanding that a new Standby datacentre is to be provisioned i.e. the failed primary datacentre.

Attention: This and the previous command can be combined as a single command. For only documentation purposes were they split into two separate ones.

Version 2: Using REST API

The benefit of this method is that it can be executed from any host that can access any of the cluster’s port 8008 which is managed by Patroni. 

Promote Standby Leader

The host is promoted, recovery is completed and becomes read-write.

Create New Slot

As per the previously demonstrated example, a new slot is created on the new Leader for the purposes of replicating to a Standby Leader.

Reprovision Deprecated Primary Datacentre As The New Standby

The following instructions are somewhat similar to the previous ones. Execute the following in Patroni cluster “Frankfurt”.

Version 1: patronictl

Provision New Standby Leader

Drop Old Slot

Unless there’s an acute reason requiring the slot to remain on the newly promoted Leader it should be removed.

Attention: attempting to remove the slot using the postgres function called pg_drop_replication_slot will fail because patroni will simply put it back.

Version 2: REST API

Provision New Standby Leader

Drop Old Slot

Caveat

  • Remember, you are administering two distinct replication clusters i.e. Primary and Standby.
  • Using patronictl requires access to one of the hosts in each respective cluster.
  • Using the REST API
    • presumes Patroni’s port 8008 can be reached when using command line interfaces such as curl. 
    • Because access to the port is required it presents a potential security risk therefore either TLS with a password is used or the firewall rule must be configured accordingly.
  • It goes without saying … watch out for split brain scenarios

Finally: here’s a previous blog on Patroni disaster recovery which also explains the basic thinking behind the technology.

Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments