Impact of Starting PostgreSQL Service Manually in an Active Patroni Cluster

A Nightmare no DBA wants

One day, I had a nightmare, the kind every DBA fears.
My manager called me and said,
“Hey… Mr. Clever. We don’t need you on the team anymore.”

I panicked.
“What did I do wrong?!”

He replied,
“You tell me. Think about your ‘great’ work yesterday.”

Trying to defend myself, I said proudly:
“Sir, one node was down. PostgreSQL service was inactive, so I started it… and the node came back up!
Pretty smart, right?”

He sighed, paused for dramatic effect, and said,
“Get well soon.”
Then he hung up.

I woke up instantly, my heart racing.
Not because of the boss,
But because the real horror hit me:

We are running a Patroni cluster…
…and I started PostgreSQL manually in my dream.
That’s not “clever work.”
That’s a database-level crime, punishable by
“Why-is-my-cluster-broken-now?”
and
“Why-are-my-replicas-on-a-different-timeline?”
and possibly
“Why-is-ETCD-looking-at-me-like-that?”

Thankfully, it was only a nightmare.

This blog aims to help prevent anyone from living that nightmare. It clearly explains what happened, why it’s dangerous, and how to avoid it.

Understanding Patroni’s Role in the Cluster :

Patroni is responsible for managing :

Startup and shutdown of PostgreSQL
Replication configuration
Leader election and failover/fencing decisions
WAL (Write-Ahead Logging) timeline coordination
Recovery decisions and cluster consistency
Coordinating cluster state with the DCS (e.g., ETCD)
Preventing divergence and corruption

What Actually Happens When PostgreSQL Starts Outside Patroni

In a Patroni-based high-availability (HA) cluster, it is not recommended to manually start PostgreSQL in a running Patroni cluster using either the postgres service or pg_ctl.

If you start PostgreSQL manually while Patroni is down:

When PostgreSQL is manually started while Patroni is down, the node is no longer under Patroni’s control and begins operating as a standalone instance.

However, this does not immediately create a conflicting timeline.

A conflicting (divergent) timeline only occurs if a failover or switchover happens while the node is standalone.
If the manually started node was previously a replica, and no failover occurred, it may still continue replicating from the primary but it is invisible to Patroni and cannot participate in coordinated HA.
If the manually started node was the former leader, and Patroni elects a new leader elsewhere, only then does a true timeline divergence occur.
Standalone replica + no failover → replication may continue, no new timeline
Standalone leader + failover occurs → timeline divergence, requiring pg_rewind or full reinit

Takeaway: A manually started primary is the most dangerous scenario; it can immediately lead to a divergent timeline if a new leader is elected. A manually started replica is still dangerous, but divergence occurs only if a leader change happens while it runs standalone.

Before diving deeper, I’d like to share a test with you to provide a better understanding of potential outcomes.

Detailed Test Case:

Operating System:
Red Hat Enterprise Linux release 9.6 (Plow)

Cluster Topology:

Node	IP Address	Role (Initial)
node1	192.168.183.131	Leader
node2	192.168.183.129	Replica
node3	192.168.183.130	Replica
PostgreSQL Version: 17.7
Patroni: Active on all nodes
DCS: etcd 3-node cluster

Note: Verify ETCD major version compatibility. Older Linux distributions may still ship legacy ETCD releases that are incompatible with modern Patroni versions.

Replication: Streaming
pg_rewind: Enabled

Operating System:

Red Hat Enterprise Linux release 9.6 (Plow)

Cluster Topology:

Node IP Address Role (Initial)

node1 192.168.183.131 Leader

node2 192.168.183.129 Replica

node3 192.168.183.130 Replica

PostgreSQL Version: 17.7

Patroni: Active on all nodes

DCS: etcd 3-node cluster

Note: Verify ETCD major version compatibility. Older Linux distributions may still ship legacy ETCD releases that are incompatible with modern Patroni versions.

Replication: Streaming

pg_rewind: Enabled

Capture Baseline State

[root@node1 ~]# patronictl -c /etc/patroni/patroni.yml list

+ Cluster: postgres (7575009857136431951) -------+----+-------------+-----+------------+-----+
| Member |       Host      |   Role  |   State   | TL | Receive LSN | Lag | Replay LSN | Lag |
+--------+-----------------+---------+-----------+----+-------------+-----+------------+-----+
| node1  | 192.168.183.131 | Leader  | running   |  4 |             |     |            |     |
| node2  | 192.168.183.129 | Replica | streaming |  4 |   0/5055240 |   0 |  0/5055240 |   0 |
| node3  | 192.168.183.130 | Replica | streaming |  4 |   0/5055240 |   0 |  0/5055240 |   0 |
+--------+-----------------+---------+-----------+----+-------------+-----+------------+-----+

[root@node1 ~]# patronictl -c /etc/patroni/patroni.yml list

+ Cluster: postgres (7575009857136431951) -------+----+-------------+-----+------------+-----+

+--------+-----------------+---------+-----------+----+-------------+-----+------------+-----+

| node2 | 192.168.183.129 | Replica | streaming | 4 | 0/5055240 | 0 | 0/5055240 | 0 |

| node3 | 192.168.183.130 | Replica | streaming | 4 | 0/5055240 | 0 | 0/5055240 | 0 |

+--------+-----------------+---------+-----------+----+-------------+-----+------------+-----+

This confirms cluster health: replicas have received and replayed WAL up to the leader’s current LSN and all nodes are on the same timeline.

Crash Test & Failure Simulation:

Now, simulating the crash on node1:

1] Kill Patroni on node1 (the current leader)
2] Start PostgreSQL manually:

[root@node1 ~]# systemctl start postgresql-17.service

1	[root@node1 ~]# systemctl start postgresql-17.service

3] Node1 is now a standalone primary on timeline 4.
4] Meanwhile, Patroni promotes node3 → timeline 5.

Patroni now shows the changed cluster state:

[root@node1 ~]# patronictl -c /etc/patroni/patroni.yml list
+ Cluster: postgres (7575009857136431951) -------+----+-------------+-----+------------+-----+
| Member |       Host      |   Role  |   State   | TL | Receive LSN | Lag | Replay LSN | Lag |
+--------+-----------------+---------+-----------+----+-------------+-----+------------+-----+
| node2  | 192.168.183.129 | Replica | streaming |  5 |   0/5055380 |   0 |  0/5055380 |   0 |
| node3  | 192.168.183.130 | Leader  | running   |  5 |             |     |            |     |
+--------+-----------------+---------+-----------+----+-------------+-----+------------+-----+

[root@node1 ~]# patronictl -c /etc/patroni/patroni.yml list

+ Cluster: postgres (7575009857136431951) -------+----+-------------+-----+------------+-----+

+--------+-----------------+---------+-----------+----+-------------+-----+------------+-----+

| node2 | 192.168.183.129 | Replica | streaming | 5 | 0/5055380 | 0 | 0/5055380 | 0 |

+--------+-----------------+---------+-----------+----+-------------+-----+------------+-----+

From the output shown above, we can see that a timeline divergence has occurred, and node1 is missing because its services are down. Rather than starting the Patroni services, I directly initiated the PostgreSQL service. The output also indicates that node3 is now the leader and node2 is functioning as the replica, both operating on the same timeline. Additionally, it’s clear that node1 is no longer part of the Patroni cluster.

Divergence Demonstration

Writes on standalone node1

postgres=# SELECT timeline_id FROM pg_control_checkpoint();
 timeline_id 
-------------
           4
(1 row)

postgres=#  SELECT pg_current_wal_lsn();
 pg_current_wal_lsn 
--------------------
 0/50552F0
(1 row)

postgres=# SELECT timeline_id FROM pg_control_checkpoint();

timeline_id

-------------

(1 row)

postgres=# SELECT pg_current_wal_lsn();

pg_current_wal_lsn

--------------------

0/50552F0

(1 row)

Now perform transactions on the standalone node1:

postgres=# create table crash_patroni (id int);
CREATE TABLE

postgres=# SELECT pg_current_wal_lsn();
 pg_current_wal_lsn 
--------------------
 0/506EDF8
(1 row)

postgres=# insert into crash_patroni values(1), (2), (3);
INSERT 0 3

postgres=# SELECT pg_current_wal_lsn();
 pg_current_wal_lsn 
--------------------
 0/5070E68
(1 row)

postgres=# create table crash_patroni (id int);

CREATE TABLE

postgres=# SELECT pg_current_wal_lsn();

pg_current_wal_lsn

--------------------

0/506EDF8

(1 row)

postgres=# insert into crash_patroni values(1), (2), (3);

INSERT 0 3

postgres=# SELECT pg_current_wal_lsn();

pg_current_wal_lsn

--------------------

0/5070E68

(1 row)

Meanwhile, on the new leader (node3):

postgres=# create table crash_new_leader (id int);
CREATE TABLE
postgres=# 

postgres=# SELECT pg_current_wal_lsn();
 pg_current_wal_lsn 
--------------------
 0/5070B80
(1 row)

postgres=# create table crash_new_leader (id int);

CREATE TABLE

postgres=#

postgres=# SELECT pg_current_wal_lsn();

pg_current_wal_lsn

--------------------

0/5070B80

(1 row)

Node2 replicates Node3’s changes normally.

postgres=# select * from crash_new_leader;
 id 
----
  1
  2
  3
(3 rows)

postgres=# select * from crash_new_leader;

----

(3 rows)

Patroni listing on node3:

[root@node3 ~]# patronictl -c /etc/patroni/patroni.yml list

+ Cluster: postgres (7575009857136431951) -------+----+-------------+-----+------------+-----+
| Member |       Host      |   Role  |   State   | TL | Receive LSN | Lag | Replay LSN | Lag |
+--------+-----------------+---------+-----------+----+-------------+-----+------------+-----+
| node2  | 192.168.183.129 | Replica | streaming |  5 |   0/5070B80 |   0 |  0/5070B80 |   0 |
| node3  | 192.168.183.130 | Leader  | running   |  5 |             |     |            |     |
+--------+-----------------+---------+-----------+----+-------------+-----+------------+-----+

[root@node3 ~]# patronictl -c /etc/patroni/patroni.yml list

+ Cluster: postgres (7575009857136431951) -------+----+-------------+-----+------------+-----+

+--------+-----------------+---------+-----------+----+-------------+-----+------------+-----+

| node2 | 192.168.183.129 | Replica | streaming | 5 | 0/5070B80 | 0 | 0/5070B80 | 0 |

+--------+-----------------+---------+-----------+----+-------------+-----+------------+-----+

Rejoining node1 (start Patroni)

When Patroni is restarted on node1, it detects the divergence and runs pg_rewind to synchronize the datadir with the new leader:

[root@node1 ~]# patronictl list

node1 | Replica | TL 5

[root@node1 ~]# patronictl list

node1 | Replica | TL 5

Node1 is rewound and joins as a replica.

[root@node1 ~]# patronictl -c /etc/patroni/patroni.yml list

+ Cluster: postgres (7575009857136431951) -------+----+-------------+-----+------------+-----+
| Member |       Host      |   Role  |   State   | TL | Receive LSN | Lag | Replay LSN | Lag |
+--------+-----------------+---------+-----------+----+-------------+-----+------------+-----+
| node1  | 192.168.183.131 | Replica | streaming |  5 |   0/5070CC0 |   0 |  0/5070CC0 |   0 |
| node2  | 192.168.183.129 | Replica | streaming |  5 |   0/5070CC0 |   0 |  0/5070CC0 |   0 |
| node3  | 192.168.183.130 | Leader  | running   |  5 |             |     |            |     |
+--------+-----------------+---------+-----------+----+-------------+-----+------------+-----+

[root@node1 ~]# patronictl -c /etc/patroni/patroni.yml list

+ Cluster: postgres (7575009857136431951) -------+----+-------------+-----+------------+-----+

+--------+-----------------+---------+-----------+----+-------------+-----+------------+-----+

| node1 | 192.168.183.131 | Replica | streaming | 5 | 0/5070CC0 | 0 | 0/5070CC0 | 0 |

| node2 | 192.168.183.129 | Replica | streaming | 5 | 0/5070CC0 | 0 | 0/5070CC0 | 0 |

+--------+-----------------+---------+-----------+----+-------------+-----+------------+-----+

Now check the table created earlier on node1:

postgres=# select * from crash_patroni;

ERROR:  relation "crash_patroni" does not exist

postgres=# select * from crash_patroni;

ERROR: relation "crash_patroni" does not exist

The table created on node3 (crash_new_leader) is now present on node1:

postgres=# select * from crash_new_leader;

 id 

----
  1
  2
  3

(3 rows)

postgres=# select * from crash_new_leader;

----

(3 rows)

What happened: Let me explain.
Patroni detected:

Node1 was on timeline 4
The cluster progressed to timeline 5 (via promotion of node3)
Divergence occurred

Patroni invoked pg_rewind:

pg_rewind: servers diverged at WAL location...
pg_rewind: rewinding...
pg_rewind: Done!

pg_rewind: servers diverged at WAL location...

pg_rewind: rewinding...

pg_rewind: Done!

As a result:

Node1’s data directory was rewritten to match the new leader (node3)
The table crash_patroni created while node1 was standalone was lost (data loss)
Node1 successfully rejoined as a replica on timeline 5
Writes performed during standalone mode were erased

Pg_rewind: prerequisites and failure modes

pg_rewind is the safest recovery method to repair a diverged node because it:

Resets the diverging timeline
Reconstructs the datadir to match the new leader
Rejoins the node without requiring a full base backup

Important prerequisites:

wal_log_hints = on OR data checksums must be enabled.
Patroni typically ensures wal_log_hints = on for you, but always verify the setting, especially on manually built clusters or upgraded nodes.

If pg_rewind is:

Disabled
Misconfigured
Fails due to missing WAL segments or catalog mismatch
Crashes unexpectedly

then reinitialization (a full base backup + restore) is the only recovery option.

Key Risks and Considerations

1. Patroni must be the sole controller

Manual PostgreSQL start = Rogue instance.

2. Standalone writes WILL be lost

Every operation performed in standalone mode disappears.

3. Split-brain–like symptoms

Even if not a true split-brain, symptoms include:

Diverged timelines
Orphaned tables
Inconsistent LSNs
Confusing recovery decisions

4. Operational confusion

Different nodes may disagree on whether a table exists.

Precautions practical steps

Mask the OS service
Prevent accidental systemctl start postgresql-XX.service by masking the unit:

sudo systemctl mask postgresql-XX.service

1	sudo systemctl mask postgresql-XX.service

Monitor both Patroni and PostgreSQL services
Create alerts for:

Unexpected postgresql-XX.service = active when patroni. service is down
Missing Patroni heartbeat
Timeline divergence detection
Replicas not receiving WAL
Unexpected PID or process-name mismatches
pg_rewind events in Patroni logs

Validate pg_rewind prerequisites
Confirm wal_log_hints = on or data checksums are enabled before taking cluster nodes into production.

Check ETCD compatibility
Ensure your ETCD major version is compatible with your Patroni release; older distributions may contain legacy ETCD packages.

Use port-based health checks
Use netstat checks for port 5432 rather than process-name matching across heterogeneous OS families.

Backup when in doubt
If you suspect standalone writes occurred before restarting Patroni, take a filesystem-level backup of the datadir if possible — this can help forensic investigation.

Early detection is essential to prevent data loss.

Summary

Starting PostgreSQL manually in a Patroni-managed HA cluster is not a harmless mistake — it is a data-consistency risk. It can cause:

Timeline divergence
Split-brain–like behavior
Lost transactions and DDL
Cluster confusion and administrative overhead
Forced pg_rewind or full reinitialization

Follow the precautions above, ensure pg_rewind prerequisites, monitor both Patroni and PostgreSQL, and always start PostgreSQL via Patroni. Do that, and you’ll greatly reduce the risk of waking up to a DBA nightmare.

If you’re using a Patroni cluster, always start PostgreSQL through the Patroni service. Your database will appreciate it. Let’s keep everything running smoothly!

MySQL 5.7 Support

Compare Percona to Leading Database Solutions

Software Downloads

Valkey Contribution

Product Documentation

Resource Hub

Why Percona for MongoDB?

Why Percona for PostgreSQL?

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Impact of Starting PostgreSQL Service Manually in an Active Patroni Cluster

Understanding Patroni’s Role in the Cluster :

What Actually Happens When PostgreSQL Starts Outside Patroni

Capture Baseline State

Crash Test & Failure Simulation:

Now, simulating the crash on node1:

1] Kill Patroni on node1 (the current leader) 2] Start PostgreSQL manually:

Divergence Demonstration

Rejoining node1 (start Patroni)

Key Risks and Considerations

1. Patroni must be the sole controller

2. Standalone writes WILL be lost

3. Split-brain–like symptoms

4. Operational confusion

Precautions practical steps

Summary

Share This Post!

Stay up to date with the Percona Blog

Related Blog Articles

RECOMMENDED ARTICLES

Importance of Tuning Checkpoint in PostgreSQL

Rebuilding a Replica with MyDumper

Databases, Data Lakes, And Encryption

MOST POPULAR ARTICLES

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL Performance Tuning: Maximizing Database Efficiency and Speed

The Ultimate Guide to Open Source Databases

MySQL 5.7
Support

Software
Downloads

1] Kill Patroni on node1 (the current leader)
2] Start PostgreSQL manually: