PostgreSQL logical replication adoption is becoming more popular as significant advances continue to expand its range of capabilities. 

While quite a few blogs have described features, there seems to be a lack of simple and straightforward advice on restoring stalled replication.

This blog demonstrates an extremely powerful approach to resolving replication problems using the Log Sequence Number, LSN.

For the purposes of this blog we’ll confine ourselves to INSERT operations that fail to logically replicate from the PUBLICATION to the SUBSCRIBED table. The technique described herein demonstrates the manipulation of these records by “skipping” them via function calls using the LSN. Once understood, this method can be used for more complex scenarios.

Stalled logical replication can occur under any number of circumstances:

  • At Creation/Subscription time
  • When performing DML operations
  • When performing DDL operations

In a manner of speaking, a logical replication stall is basically the poster child for Murphy’s Law.

Monitoring logical replication

The following SQL statements can detect logical decoding failures. Keep in mind that most modern PostgreSQL monitoring solutions already use one or more of these metrics:

PUBLICATION Side

 

SUBSCRIPTION Side

About slots

Recognizing the significance of the replication slot and its current state is crucial:

  • Logical replication slots can use different plugins, but these are the most common:
    • pgoutput: used for virtually all logical decoding purposes
    • test_decoding: an example of a logical decoding output plugin, used for understanding logical decoding
    • decoderbufs: delivers data as Protocol Buffers, used by Debezium for CDC purposes
  • For every slot, only one connection can be made from a SUBSCRIBED connection.
  • Querying the slot state, true/false, is ascertained via the function pg_get_replication_slots().
  • There are many function calls capable of administering logically replicated slots.
  • Slots cannot be deleted while its state is “true”
  • Upon logical replication failures, the state becomes inactive and its activity flag is set to “false”
  • Depending upon the version and its runtime parameter settings, inactive slots can be limited to retaining WALs, ex: max_slot_wal_keep_size, thus SUBSCRIBED tables can lose synchronization.

A single, inconsistent record in just one table is sufficient to stall replication for an entire collection of subscribed tables.

About Log Sequence Numbers (LSN)

The LSN points to specific locations within the WAL stream. This cluster-wide identity plays a fundamental role in replication, recovery, and data consistency by tracking all DML and DDL operations over the database’s lifetime.

LSNs are used in logical replication by the subscriber by tracking how far they have processed the WAL stream. When properly synchronized, the LSN guarantees that all changes to the PUBLICATION are consistent with the PUBLICATION

In the case of stalled replication, Postgres functions can use LSNs as arguments in order to restore logical replication.

Scenario

By adding a constraint to the subscribed table, a replication stall is caused when inserting multiple records in the published table with the same value that violates the unique constraint in the subscribed table.

In this example, resolution is achieved by “skipping” the conflicted record(s).

Set up Logical Replication

PG1: PUBLICATION, table t1
PG1: PUBLICATION, table t1

PG2: SUBSCRIPTION, table t1 (a UNIQUE constraint is added on column “c2”)
PG2: SUBSCRIPTION, table t1

PG1

PG2

Replication stalls

PG1: Add conflicting records


PG2: logical replication is stalled due to UNIQUE constraint on column “c2”



PG2: Conflicting LSN flagged in postgres logs



Mitigation

ID conflicting records

The function pg_logical_slot_peek_binary_changes is used to query the state of replication.

Function pg_replication_slot_advance advances the current position of the replication slot after the oldest blocking INSERT, permitting replication to resume.

PG1: Slot activity is FALSE in order for this query to work




Skip conflicting records

PG1: Executing this query identifies the blocking records both by transaction ID and LSN.






Attention: Note that in a typical production environment, unlike this simple scenario, both logically correct and stalling records ordered by LSN will result in mixed-order records. A nuanced approach to skipping records will, therefore, be required.

Validation

Querying table t1 in both publication and subscription demonstrates that replication has been restored, but the problem records, while still present in PG1, have not been replicated to PG2.


Caveat

  • Reading the Postgres logs on both servers that PUBLISH and SUBSCRIBE offer the most information identifying the root cause.
  • When conducted properly, successful mitigation virtually resolves most issues without any data loss.
  • Beware: modern versions of Postgres include runtime parameters that limit WALs retention. There can be a time limit to how many or how long the server can hold onto WALs being held by inactive replication slots.

Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments