This post was originally published in 2023, and we’ve updated it in 2025 for clarity and relevance.

Downtime is more than an inconvenience. For many organizations, even a short outage can mean lost revenue, broken customer trust, or compliance issues. PostgreSQL is a cornerstone for critical applications, and disaster recovery (DR) is essential when running it across regions or clouds.

But here’s the challenge: Kubernetes adds complexity. Managing DR across multi-cloud or hybrid environments can become overwhelming fast. Manual processes don’t scale, and traditional approaches often fall short. This is where Percona Operator for PostgreSQL comes in. It automates repeatable DR processes and reduces the operational burden, giving you a reliable way to keep PostgreSQL clusters resilient.

In this guide, you’ll learn how to set up DR with Percona Operator for PostgreSQL (v2) using a repo-based standby, the simplest approach to getting started.

Overview of the solution

Operators are built to handle routine database management tasks, so your team doesn’t have to. For DR, the Operator supports three standby options:

  • pgBackRest repo-based standby 
  • Streaming replication 
  • A combination of both

We’ll walk through the repo-based standby, which is straightforward and widely used. Here’s the setup at a high level:

  1. Two Kubernetes clusters in different regions, clouds, or hybrid mode (on-premises + cloud). One is the Main site, the other is Disaster Recovery (DR). 
  2. Each cluster runs: 
    • Percona Operator 
    • PostgreSQL cluster 
    • pgBackRest 
    • pgBouncer 
  3. pgBackRest on the Main site streams backups and Write Ahead Logs (WALs) to object storage. 
  4. pgBackRest on the DR site pulls these backups and streams them to the standby cluster.

Setting up the main site backups

Start by deploying the Operator using your preferred method from the documentation. Once it’s installed, configure the Custom Resource manifest so pgBackrest can use your chosen object storage. (If you’ve already configured this, you can skip ahead.)

Here’s an example setup for Google Cloud Storage (GCS):

The main-pgbackrest-secrets file contains your GCS keys. You can find more details in the backup and restore tutorial.

Once your configuration is ready, apply the custom resource:

At this point, your backups should appear in the object storage. By default, pgBackrest saves them in the pgbackrest folder.

Configuring the disaster recovery site

The DR site is nearly identical to the Main site, except for the standby settings. Below, you’ll see that standby.enabled is set to true, and the repoName points to the same backup repository (GCS in this example):

Deploy the standby cluster with:

Now your standby cluster is continuously synced with the Main site’s backups.

Promoting the standby cluster during failover

When the Main site goes down, or in other failover scenarios, you can promote the standby cluster. Promotion makes the standby writable, effectively turning it into the new primary.

One critical safeguard: if both clusters write to the same repository, you risk a split-brain scenario. Always delete or shut down the original primary before promoting the standby.

To promote, update the manifest to disable standby mode:

Once this is applied, your standby cluster is promoted and ready to handle writes.

Avoiding split-brain scenarios

If your old primary accidentally comes back online and starts writing, you’ll need to recover carefully. Here’s the safe sequence:

  • Keep only the newest primary running with the latest data.
  • Stop writes on the other cluster immediately.
  • Take a fresh full backup from the active primary and upload it to the repo.

This ensures your environment is consistent and avoids corruption.

Automating failover for faster recovery

Manual failover works, but for mission-critical workloads, automation keeps your Recovery Time Objective (RTO) low.

Full automation is beyond the Operator’s scope, but here are steps you can take:

  • Add a monitoring site: A third cluster can monitor both Main and DR, reducing the risk of confusing a network split with a true outage. 
  • Automate traffic routing: Shift application traffic to the standby after promotion. Common approaches include: 
    • Global Load Balancers (offered by most cloud providers) 
    • Multi-Cluster Services (MCS) 
    • Federation or other multi-cluster networking solutions

The right choice depends on how your application and networking are designed, but building automation upfront pays off in faster recovery.

Final thoughts: Making PostgreSQL disaster recovery Kubernetes-ready

Disaster recovery is how you keep your business running through outages. Running PostgreSQL on Kubernetes adds layers of complexity, but with Percona Operator, you get a repeatable way to protect data across regions and clouds.

We’ve walked through a simple repo-based standby setup, but the bigger picture is clear: with the right tools and planning, DR becomes a strength, not a stress point.

If you want to go deeper into building a production-ready PostgreSQL environment on Kubernetes, one that balances high availability, automation, and cost control,  check out our guide on becoming Kubernetes ready with Percona.

 

Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments