A Nightmare no DBA wants
One day, I had a nightmare, the kind every DBA fears.
My manager called me and said,
“Hey… Mr. Clever. We don’t need you on the team anymore.”
I panicked.
“What did I do wrong?!”
He replied,
“You tell me. Think about your ‘great’ work yesterday.”
Trying to defend myself, I said proudly:
“Sir, one node was down. PostgreSQL service was inactive, so I started it… and the node came back up!
Pretty smart, right?”
He sighed, paused for dramatic effect, and said,
“Get well soon.”
Then he hung up.
I woke up instantly, my heart racing.
Not because of the boss,
But because the real horror hit me:
We are running a Patroni cluster…
…and I started PostgreSQL manually in my dream.
That’s not “clever work.”
That’s a database-level crime, punishable by
“Why-is-my-cluster-broken-now?”
and
“Why-are-my-replicas-on-a-different-timeline?”
and possibly
“Why-is-ETCD-looking-at-me-like-that?”
Thankfully, it was only a nightmare.
This blog aims to help prevent anyone from living that nightmare. It clearly explains what happened, why it’s dangerous, and how to avoid it.
Understanding Patroni’s Role in the Cluster :
Patroni is responsible for managing :
- Startup and shutdown of PostgreSQL
- Replication configuration
- Leader election and failover/fencing decisions
- WAL (Write-Ahead Logging) timeline coordination
- Recovery decisions and cluster consistency
- Coordinating cluster state with the DCS (e.g., ETCD)
- Preventing divergence and corruption
What Actually Happens When PostgreSQL Starts Outside Patroni
In a Patroni-based high-availability (HA) cluster, it is not recommended to manually start PostgreSQL in a running Patroni cluster using either the postgres service or pg_ctl.
If you start PostgreSQL manually while Patroni is down:
When PostgreSQL is manually started while Patroni is down, the node is no longer under Patroni’s control and begins operating as a standalone instance.
However, this does not immediately create a conflicting timeline.
- A conflicting (divergent) timeline only occurs if a failover or switchover happens while the node is standalone.
- If the manually started node was previously a replica, and no failover occurred, it may still continue replicating from the primary but it is invisible to Patroni and cannot participate in coordinated HA.
- If the manually started node was the former leader, and Patroni elects a new leader elsewhere, only then does a true timeline divergence occur.
- Standalone replica + no failover → replication may continue, no new timeline
- Standalone leader + failover occurs → timeline divergence, requiring pg_rewind or full reinit
Takeaway: A manually started primary is the most dangerous scenario; it can immediately lead to a divergent timeline if a new leader is elected. A manually started replica is still dangerous, but divergence occurs only if a leader change happens while it runs standalone.
Before diving deeper, I’d like to share a test with you to provide a better understanding of potential outcomes.
Detailed Test Case:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
Operating System: Red Hat Enterprise Linux release 9.6 (Plow) Cluster Topology: Node IP Address Role (Initial) node1 192.168.183.131 Leader node2 192.168.183.129 Replica node3 192.168.183.130 Replica PostgreSQL Version: 17.7 Patroni: Active on all nodes DCS: etcd 3-node cluster Note: Verify ETCD major version compatibility. Older Linux distributions may still ship legacy ETCD releases that are incompatible with modern Patroni versions. Replication: Streaming pg_rewind: Enabled |
Capture Baseline State
|
1 2 3 4 5 6 7 8 9 |
[root@node1 ~]# patronictl -c /etc/patroni/patroni.yml list + Cluster: postgres (7575009857136431951) -------+----+-------------+-----+------------+-----+ | Member | Host | Role | State | TL | Receive LSN | Lag | Replay LSN | Lag | +--------+-----------------+---------+-----------+----+-------------+-----+------------+-----+ | node1 | 192.168.183.131 | Leader | running | 4 | | | | | | node2 | 192.168.183.129 | Replica | streaming | 4 | 0/5055240 | 0 | 0/5055240 | 0 | | node3 | 192.168.183.130 | Replica | streaming | 4 | 0/5055240 | 0 | 0/5055240 | 0 | +--------+-----------------+---------+-----------+----+-------------+-----+------------+-----+ |
This confirms cluster health: replicas have received and replayed WAL up to the leader’s current LSN and all nodes are on the same timeline.
Crash Test & Failure Simulation:
Now, simulating the crash on node1:
1] Kill Patroni on node1 (the current leader)
2] Start PostgreSQL manually:
|
1 |
[root@node1 ~]# systemctl start postgresql-17.service |
3] Node1 is now a standalone primary on timeline 4.
4] Meanwhile, Patroni promotes node3 → timeline 5.
Patroni now shows the changed cluster state:
|
1 2 3 4 5 6 7 |
[root@node1 ~]# patronictl -c /etc/patroni/patroni.yml list + Cluster: postgres (7575009857136431951) -------+----+-------------+-----+------------+-----+ | Member | Host | Role | State | TL | Receive LSN | Lag | Replay LSN | Lag | +--------+-----------------+---------+-----------+----+-------------+-----+------------+-----+ | node2 | 192.168.183.129 | Replica | streaming | 5 | 0/5055380 | 0 | 0/5055380 | 0 | | node3 | 192.168.183.130 | Leader | running | 5 | | | | | +--------+-----------------+---------+-----------+----+-------------+-----+------------+-----+ |
From the output shown above, we can see that a timeline divergence has occurred, and node1 is missing because its services are down. Rather than starting the Patroni services, I directly initiated the PostgreSQL service. The output also indicates that node3 is now the leader and node2 is functioning as the replica, both operating on the same timeline. Additionally, it’s clear that node1 is no longer part of the Patroni cluster.
Divergence Demonstration
Writes on standalone node1
|
1 2 3 4 5 6 7 8 9 10 11 |
postgres=# SELECT timeline_id FROM pg_control_checkpoint(); timeline_id ------------- 4 (1 row) postgres=# SELECT pg_current_wal_lsn(); pg_current_wal_lsn -------------------- 0/50552F0 (1 row) |
Now perform transactions on the standalone node1:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
postgres=# create table crash_patroni (id int); CREATE TABLE postgres=# SELECT pg_current_wal_lsn(); pg_current_wal_lsn -------------------- 0/506EDF8 (1 row) postgres=# insert into crash_patroni values(1), (2), (3); INSERT 0 3 postgres=# SELECT pg_current_wal_lsn(); pg_current_wal_lsn -------------------- 0/5070E68 (1 row) |
Meanwhile, on the new leader (node3):
|
1 2 3 4 5 6 7 8 9 |
postgres=# create table crash_new_leader (id int); CREATE TABLE postgres=# postgres=# SELECT pg_current_wal_lsn(); pg_current_wal_lsn -------------------- 0/5070B80 (1 row) |
Node2 replicates Node3’s changes normally.
|
1 2 3 4 5 6 7 |
postgres=# select * from crash_new_leader; id ---- 1 2 3 (3 rows) |
Patroni listing on node3:
|
1 2 3 4 5 6 7 8 |
[root@node3 ~]# patronictl -c /etc/patroni/patroni.yml list + Cluster: postgres (7575009857136431951) -------+----+-------------+-----+------------+-----+ | Member | Host | Role | State | TL | Receive LSN | Lag | Replay LSN | Lag | +--------+-----------------+---------+-----------+----+-------------+-----+------------+-----+ | node2 | 192.168.183.129 | Replica | streaming | 5 | 0/5070B80 | 0 | 0/5070B80 | 0 | | node3 | 192.168.183.130 | Leader | running | 5 | | | | | +--------+-----------------+---------+-----------+----+-------------+-----+------------+-----+ |
Rejoining node1 (start Patroni)
When Patroni is restarted on node1, it detects the divergence and runs pg_rewind to synchronize the datadir with the new leader:
|
1 2 3 |
[root@node1 ~]# patronictl list node1 | Replica | TL 5 |
Node1 is rewound and joins as a replica.
|
1 2 3 4 5 6 7 8 9 |
[root@node1 ~]# patronictl -c /etc/patroni/patroni.yml list + Cluster: postgres (7575009857136431951) -------+----+-------------+-----+------------+-----+ | Member | Host | Role | State | TL | Receive LSN | Lag | Replay LSN | Lag | +--------+-----------------+---------+-----------+----+-------------+-----+------------+-----+ | node1 | 192.168.183.131 | Replica | streaming | 5 | 0/5070CC0 | 0 | 0/5070CC0 | 0 | | node2 | 192.168.183.129 | Replica | streaming | 5 | 0/5070CC0 | 0 | 0/5070CC0 | 0 | | node3 | 192.168.183.130 | Leader | running | 5 | | | | | +--------+-----------------+---------+-----------+----+-------------+-----+------------+-----+ |
Now check the table created earlier on node1:
|
1 2 3 |
postgres=# select * from crash_patroni; ERROR: relation "crash_patroni" does not exist |
The table created on node3 (crash_new_leader) is now present on node1:
|
1 2 3 4 5 6 7 8 9 10 |
postgres=# select * from crash_new_leader; id ---- 1 2 3 (3 rows) |
What happened: Let me explain.
Patroni detected:
- Node1 was on timeline 4
- The cluster progressed to timeline 5 (via promotion of node3)
- Divergence occurred
Patroni invoked pg_rewind:
|
1 2 3 |
pg_rewind: servers diverged at WAL location... pg_rewind: rewinding... pg_rewind: Done! |
As a result:
- Node1’s data directory was rewritten to match the new leader (node3)
- The table crash_patroni created while node1 was standalone was lost (data loss)
- Node1 successfully rejoined as a replica on timeline 5
- Writes performed during standalone mode were erased
Pg_rewind: prerequisites and failure modes
pg_rewind is the safest recovery method to repair a diverged node because it:
- Resets the diverging timeline
- Reconstructs the datadir to match the new leader
- Rejoins the node without requiring a full base backup
Important prerequisites:
- wal_log_hints = on OR data checksums must be enabled.
Patroni typically ensures wal_log_hints = on for you, but always verify the setting, especially on manually built clusters or upgraded nodes.
If pg_rewind is:
- Disabled
- Misconfigured
- Fails due to missing WAL segments or catalog mismatch
- Crashes unexpectedly
then reinitialization (a full base backup + restore) is the only recovery option.
Key Risks and Considerations
1. Patroni must be the sole controller
Manual PostgreSQL start = Rogue instance.
2. Standalone writes WILL be lost
Every operation performed in standalone mode disappears.
3. Split-brain–like symptoms
Even if not a true split-brain, symptoms include:
- Diverged timelines
- Orphaned tables
- Inconsistent LSNs
- Confusing recovery decisions
4. Operational confusion
Different nodes may disagree on whether a table exists.
Precautions practical steps
Mask the OS service
Prevent accidental systemctl start postgresql-XX.service by masking the unit:
|
1 |
sudo systemctl mask postgresql-XX.service |
Monitor both Patroni and PostgreSQL services
Create alerts for:
- Unexpected postgresql-XX.service = active when patroni. service is down
- Missing Patroni heartbeat
- Timeline divergence detection
- Replicas not receiving WAL
- Unexpected PID or process-name mismatches
- pg_rewind events in Patroni logs
Validate pg_rewind prerequisites
Confirm wal_log_hints = on or data checksums are enabled before taking cluster nodes into production.
Check ETCD compatibility
Ensure your ETCD major version is compatible with your Patroni release; older distributions may contain legacy ETCD packages.
Use port-based health checks
Use netstat checks for port 5432 rather than process-name matching across heterogeneous OS families.
Backup when in doubt
If you suspect standalone writes occurred before restarting Patroni, take a filesystem-level backup of the datadir if possible — this can help forensic investigation.
Early detection is essential to prevent data loss.
Summary
Starting PostgreSQL manually in a Patroni-managed HA cluster is not a harmless mistake — it is a data-consistency risk. It can cause:
- Timeline divergence
- Split-brain–like behavior
- Lost transactions and DDL
- Cluster confusion and administrative overhead
- Forced pg_rewind or full reinitialization
Follow the precautions above, ensure pg_rewind prerequisites, monitor both Patroni and PostgreSQL, and always start PostgreSQL via Patroni. Do that, and you’ll greatly reduce the risk of waking up to a DBA nightmare.
If you’re using a Patroni cluster, always start PostgreSQL through the Patroni service. Your database will appreciate it. Let’s keep everything running smoothly!