Percona ClusterSync for MongoDB (PCSM) replicates data between MongoDB clusters — whether replica sets or sharded — handling both the initial bulk data clone and continuous change replication via MongoDB’s change streams. It’s designed for migrations, failover preparation, and keeping a secondary cluster in sync with production.
Version 0.8.0 brings two architectural changes to the change replication pipeline that significantly improve throughput and reduce replication lag under high write workloads. Both changes target the continuous replication phase — the long-running process that tails the source cluster’s change stream and applies events to the target.
In this post, we’ll walk through what changed, why it matters, and how the new architecture works.
The original change replication pipeline in PCSM was intentionally simple. A single goroutine handled the entire flow: reading events from the MongoDB change stream, parsing the BSON payloads, and writing bulk operations to the target cluster — all sequentially, in a single thread.
This design was a deliberate choice. Replicating data between MongoDB clusters involves subtle correctness challenges — preserving document-level operation ordering, handling DDL events that affect schema, and maintaining checkpoint consistency so replication can resume safely after a failure. Getting all of this right was the priority, and a single-threaded pipeline made the logic easier to reason about, test, and verify.
The trade-off was performance. Under high write throughput, the sequential pipeline became a bottleneck. While the worker waited for MongoDB to acknowledge a bulk write, it couldn’t read new events from the change stream. Write latency on the target directly blocked read throughput from the source. Replication lag would grow under sustained load, even when both clusters had spare capacity.
With the correctness foundation proven across months of testing and production use than adding support for sharded clusters in 0.7.0, 0.8.0 was the right time to unlock performance — without compromising those guarantees.
The first major change replaces the single-goroutine pipeline with a worker pool architecture that processes events in parallel across multiple workers.
The pipeline is now split into three stages:

Parallelism introduces risk if you’re not careful. PCSM 0.8.0 handles the edge cases:
A secondary but meaningful optimization: BSON parsing now happens inside the workers rather than in the reader. In the old pipeline, the single goroutine parsed every event before doing anything else. Now, raw BSON is passed to N workers, which parse it in parallel. This spreads CPU-bound work across cores and keeps the reader focused on consuming events from the change stream as fast as possible.
The second change goes one level deeper. Even with multiple workers, each worker still had the original problem: when it called flush() to write a bulk operation to MongoDB, it blocked until the write was acknowledged. During that time, the worker couldn’t read new events from its queue.
The async bulk-write pipeline decouples bulk building from bulk writing within each worker.
Each worker now has two goroutines:

This means workers are never idle waiting for MongoDB. They’re always reading and batching the next set of operations while the previous write completes.
The bulk queue per worker is bounded (default size: 3). When the queue is full, the main loop blocks on submit() until the writer dequeues a bulk. This provides natural backpressure — if MongoDB can’t keep up, workers slow down rather than consuming unbounded memory.
Maximum memory per worker is capped at: workers event queue size + bulk batches queue size + 1 bulk being written + 1 bulk being built. This is predictable and configurable.
Each sealed bulk captures the checkpoint timestamp at seal time — the timestamp of the latest event included in that bulk. The committed checkpoint is only advanced after the async writer confirms the bulk was persisted to MongoDB. This ensures that if PCSM crashes or restarts, it resumes from a timestamp that reflects what was actually written, never skipping events.
When a barrier event arrives (DDL or stop request), the main loop closes the bulk queue and waits for the async writer to finish all pending bulks. Only after everything is flushed does the barrier proceed. On resume, the writer goroutine is restarted with fresh state.
Together, these two features transform the pipeline from a single-threaded sequential process into a fully pipelined, parallel architecture:

The improvement is multiplicative:
The result is that the pipeline can sustain substantially higher event throughput before replication lag begins to grow.
All replication tuning parameters (worker count, queue sizes, batch sizes, flush intervals) are now configurable via CLI flags, environment variables, and the HTTP API. The defaults are auto-tuned and work well for most deployments. For the full list of options and guidance on when to adjust them, see the PCSM Configuration Reference.
We tested change replication throughput across document sizes ranging from 500 bytes to 200 KB. Source client compressors set for maximizing change stream read throughput. The metric is maximum sustained QPS (queries per second on the source) before replication lag begins to accumulate.
PCSM was running on an AWS i3en.3xlarge instance with 12 CPU cores and 96 GiB of RAM.
Replica Set Clusters Test Results

From the graph we can see that we have 4.5x improvement for small, 4.2x for regular and 3.1x for big documents.
Sharded Cluster Test Results

On sharded clusters we can see much bigger improvements, that is 14,5x for small, 18,5x for regular and 6x improvement for big documents.
PCSM 0.8.0 removes the major bottleneck that existed during the apply phase — parsing events and writing them to the target. With parallel workers and async writes, the apply pipeline can now saturate the target cluster’s write capacity much more effectively.
Interestingly, this shifts the bottleneck. In our testing, the primary limiting factor is now the rate at which events can be read from MongoDB’s change stream, particularly when documents are large. Change stream throughput is ultimately governed by MongoDB server-side performance and network bandwidth, which are outside PCSM’s control. This is an area we’re actively investigating for future releases.
For sharded clusters, we plan to add support for multi-instance PCSM deployment, where each PCSM instance handles a single shard independently. This will allow replication throughput to scale linearly with the number of shards.
If you’re evaluating PCSM for a migration or considering an upgrade, 0.8.0 is a significant step forward for write-heavy workloads. The defaults are designed to work well without tuning, but the new configuration options give you full control when you need it.
Percona thrives on community collaboration. As we continue to refine PCSM and celebrate its production-ready GA release, we invite you to get involved. We welcome bug reports, feature suggestions, and code contributions. Your input helps us build the tools the MongoDB community actually needs. We encourage you to deploy PCSM in your production environments today! Check out the PCSM repository and join the journey with us!
Ready to break free from vendor lock-in? Check out the PCSM Documentation to get started with your first sync today.
Resources
RELATED POSTS