PostgreSQL logical replication adoption is becoming more popular as significant advances continue to expand its range of capabilities.
While quite a few blogs have described features, there seems to be a lack of simple and straightforward advice on restoring stalled replication.
This blog demonstrates an extremely powerful approach to resolving replication problems using the Log Sequence Number, LSN.
For the purposes of this blog we’ll confine ourselves to INSERT operations that fail to logically replicate from the PUBLICATION to the SUBSCRIBED table. The technique described herein demonstrates the manipulation of these records by “skipping” them via function calls using the LSN. Once understood, this method can be used for more complex scenarios.
Stalled logical replication can occur under any number of circumstances:
- At Creation/Subscription time
- When performing DML operations
- When performing DDL operations
In a manner of speaking, a logical replication stall is basically the poster child for Murphy’s Law.
Monitoring logical replication
The following SQL statements can detect logical decoding failures. Keep in mind that most modern PostgreSQL monitoring solutions already use one or more of these metrics:
PUBLICATION Side
| 1 2 3 4 5 6 | select * from pg_get_replication_slots(); select * from pg_publication_tables; select * from pg_stat_activity; select * from pg_stat_replication; select * from pg_publication; select * from pg_publication_tables; |
SUBSCRIPTION Side
| 1 2 | select * from pg_subscription; select * from pg_stat_subscription; |
About slots
Recognizing the significance of the replication slot and its current state is crucial:
- Logical replication slots can use different plugins, but these are the most common:
- pgoutput: used for virtually all logical decoding purposes
- test_decoding: an example of a logical decoding output plugin, used for understanding logical decoding
- decoderbufs: delivers data as Protocol Buffers, used by Debezium for CDC purposes
- For every slot, only one connection can be made from a SUBSCRIBED connection.
- Querying the slot state, true/false, is ascertained via the function pg_get_replication_slots().
- There are many function calls capable of administering logically replicated slots.
- Slots cannot be deleted while its state is “true”
- Upon logical replication failures, the state becomes inactive and its activity flag is set to “false”
- Depending upon the version and its runtime parameter settings, inactive slots can be limited to retaining WALs, ex: max_slot_wal_keep_size, thus SUBSCRIBED tables can lose synchronization.
A single, inconsistent record in just one table is sufficient to stall replication for an entire collection of subscribed tables.
About Log Sequence Numbers (LSN)
The LSN points to specific locations within the WAL stream. This cluster-wide identity plays a fundamental role in replication, recovery, and data consistency by tracking all DML and DDL operations over the database’s lifetime.
LSNs are used in logical replication by the subscriber by tracking how far they have processed the WAL stream. When properly synchronized, the LSN guarantees that all changes to the PUBLICATION are consistent with the PUBLICATION
In the case of stalled replication, Postgres functions can use LSNs as arguments in order to restore logical replication.
Scenario
By adding a constraint to the subscribed table, a replication stall is caused when inserting multiple records in the published table with the same value that violates the unique constraint in the subscribed table.
In this example, resolution is achieved by “skipping” the conflicted record(s).
Set up Logical Replication
PG1: PUBLICATION, table t1
PG2: SUBSCRIPTION, table t1 (a UNIQUE constraint is added on column “c2”)
PG1
PG2

Replication stalls
PG1: Add conflicting records
PG2: logical replication is stalled due to UNIQUE constraint on column “c2”
PG2: Conflicting LSN flagged in postgres logs
![]()
Mitigation
ID conflicting records
The function pg_logical_slot_peek_binary_changes is used to query the state of replication.
Function pg_replication_slot_advance advances the current position of the replication slot after the oldest blocking INSERT, permitting replication to resume.
PG1: Slot activity is FALSE in order for this query to work

Skip conflicting records
PG1: Executing this query identifies the blocking records both by transaction ID and LSN.



Attention: Note that in a typical production environment, unlike this simple scenario, both logically correct and stalling records ordered by LSN will result in mixed-order records. A nuanced approach to skipping records will, therefore, be required.
Validation
Querying table t1 in both publication and subscription demonstrates that replication has been restored, but the problem records, while still present in PG1, have not been replicated to PG2.

Caveat
- Reading the Postgres logs on both servers that PUBLISH and SUBSCRIBE offer the most information identifying the root cause.
- When conducted properly, successful mitigation virtually resolves most issues without any data loss.
- Beware: modern versions of Postgres include runtime parameters that limit WALs retention. There can be a time limit to how many or how long the server can hold onto WALs being held by inactive replication slots.