Extending pt-archiver with a Partition-Aware Plug-in for Fast Retention Policy Enforcement

June 16, 2026
Author
Corrado Pandiani
Share this Post:

Managing data retention policies is one of the most common operational tasks in MySQL.

Applications continuously generate transactional, audit, logging, telemetry, and event data. Over time, these tables can grow to billions of rows, causing:

  • Larger backups
  • Longer recovery times
  • Reduced buffer pool efficiency
  • Slower index maintenance
  • Increased storage costs
  • Degraded query performance

To address these problems, organizations typically implement retention policies based on dates or timestamps. Examples include deleting events older than 90 days or purging session data older than 30 days and so forth. The deleted data can then eventually be archived somewhere else, like in another DBMS or on external files.

One of the most widely used tools for implementing these policies in MySQL ecosystems is pt-archiver, part of the Percona Toolkit.

This article provides a review of what pt-archiver is and how to use it, but in particular it focuses on the fact this tool is not partitioning aware, and this can make the deletion phase more costly. The article shows how to extend pt-archiver with a Perl plugin to make it aware of partitioning.

 

What is pt-archiver?

pt-archiver is a command-line utility from Percona Toolkit designed to:

  • Archive rows from MySQL tables
  • Purge rows from MySQL tables
  • Move data between tables into the local database or a remote one
  • Export rows into files

In a few words: implementing retention policies safely.

The tool processes rows incrementally in chunks, avoiding massive transactions and reducing impact on production systems.

Example:

This command:

  • Scans rows matching the WHERE condition
  • Processes them in chunks of 1000 rows
  • Commits every chunk
  • Deletes matching rows from the source table

pt-archiver provides several advantages compared to ad-hoc DELETE statements.

Instead of running:

which may:

  • Lock rows for a long time
  • Generate massive undo/redo logs
  • Create replication lag
  • Exhaust transaction logs

pt-archiver processes rows incrementally to make the process overhead less impactful for the database performance.

pt-archiver implementation permits flexible archival strategies

Rows can be copied to another table on a remote host, exported to files or removed completely

More details: ps://docs.percona.com/percona-toolkit/pt-archiver.html

Example: Copy rows to a remote archive table

The following example archives rows older than 90 days from a local table into an archive table hosted on a remote MySQL server:

In this example:

  • –source defines the source table
  • –dest defines the remote archive destination
  • –where selects rows eligible for archival
  • –limit controls batch size
  • –commit-each commits every batch independently to reduce transaction overhead

-progress reports progress every 10,000 rows

If rows should be removed from the source table after being copied, add –purge

Example: Export rows to a file

The following example exports rows older than one year into a text file:

In this example:

  • –file specifies the output file
  • -output-format csv exports rows in CSV format
  • Date placeholders in the filename are expanded automatically

Rows can optionally be deleted from the source table by adding –purge

This allows pt-archiver to be used both for data retention and for offline archival workflows.

The Hidden Cost of DELETE Statements

Although pt-archiver is much safer than massive DELETE operations, it still fundamentally relies on DELETE statements.

This is a critical point.

Even when there are proper indexes, the rows are processed in chunks, and transactions are small; the large-scale DELETE operations remain expensive.

Deleting rows is expensive in InnoDB because it involves:

  • Locating rows via indexes
  • Modifying clustered indexes
  • Modifying secondary indexes
  • Generating undo logs
  • Generating redo logs
  • Purge thread processing
  • Replication event generation
  • Page fragmentation

When deleting billions of rows, the overhead becomes enormous.

Indexes help for sure, but only partially.

Consider:

If created_at is indexed, MySQL can efficiently locate rows.

However, locating rows efficiently is only part of the cost. The actual delete operations still require all those things we mentioned above.

At considerable scale, this becomes expensive.

Why RANGE Partitioning is Superior for Retention Policies

For time-based retention policies, partitioning is often dramatically more efficient. In particular, RANGE partitioning is very useful for these cases.

Example:

With partitioning, dropping old data becomes:

This operation is dramatically faster than running a DELETE.

Dropping a partition:

  • Removes an entire physical partition
  • Avoids row-by-row DELETE
  • Avoids undo generation for each row
  • Avoids secondary index maintenance per row
  • Minimizes redo generation
  • Is nearly metadata-only

This can remove millions or billions of rows in a matter of seconds without the same large cost of DELETE.

The Problem: pt-archiver is Not Partition-Aware

Unfortunately, pt-archiver does not automatically understand partitioning strategies.

Even if the table is partitioned or the retention policy perfectly matches partition boundaries, pt-archiver still executes DELETE statements.

Example:

Internally, this still produces DELETE … instead of ALTER TABLE … DROP PARTITION …

This means organizations may lose the major operational benefits of partitioning, or they need to implement custom scripts for managing the selection of rows to copy using pt-archiver and then use DROP PARTITION separately from the tool. That is doable, and to be honest, not too complicated, but why not make pt-archiver aware of partitioning for some specific use cases?

Extending pt-archiver with Pulg-ins

Fortunately, pt-archiver supports Perl plug-ins.

A plug-in can do plenty of things. Like: inspect runtime conditions, interact with MySQL, override behaviors, and execute custom logic

This gives us an opportunity to implement partition-aware retention handling.

The plug-in can:

  1. Inspect partition definitions
  2. Analyze the WHERE condition
  3. Determine which partitions are fully expired
  4. Execute ALTER TABLE DROP PARTITION
  5. Prevent row-by-row DELETE processing

This approach combines the scheduling/orchestration power of pt-archiver with the efficiency of partition pruning.

Plug-in Design

Our plug-in will:

  • Connect using the pt-archiver DB handle
  • Inspect INFORMATION_SCHEMA.PARTITIONS
  • Identify partitions older than the retention cutoff
  • Issue DROP PARTITION statements
  • Log actions
  • Skip DELETE processing

Assumptions:

  • The table is RANGE partitioned
  • Partitions are DATETIME based using the TO_DAYS() function to define ranges
  • Partition naming convention contains dates
  • Retention policy aligns with partition boundaries; if the plugin cannot determine a specific boundary, pt-archiver does nothing

Full Perl Plug-in for pt-archiver

Create the file named  pt_archiver_partition_drop.pm into the /usr/local/share/perl5 path.

Also set the environment variable PERL5LIB to let pt-archiver where to find the Perl package

Example Usage

First, create the partitioned table events and insert some fake data.

 

Now you can run the following command to delete all rows before the 1st of May, which, by the way, matches the entire first partition in the table.

 

Notice the Perl plugin must be indicated with the m option in the DSN string.

In practice:

  • pt-archiver initializes
  • The plug-in runs
  • Partitions are dropped
  • No DELETE statements are executed

Here is what you get from the execution of the above command:

You can simply verify the table has been managed correctly:

SELECT * FROM mydb.events;

SHOW CREATE TABLE mydb.events;

 

Now TRUNCATE the table and recreate the data and try now to specify the where conditions that match a RANGE that is not the first in the list of the boundaries.

You should get:

In this case, two partitions have been identified and dropped.

 

Truncate the table and recreate the data again. Try now to provide a WHERE condition that does not match any of the boundaries in the RANGE.

 

You get the following:

As expected, the tool now refuses to execute anything if it doesn’t find an exact match.

 

Operational Benefits

This approach provides major advantages.

Dropping partitions is vastly faster than deleting rows, and minimal binary logging is needed, compared to billions of row deletes. There is no massive transactional overhead for managing undo logs and purging. You get then a better InnoDB Buffer Pool stability because of less page churn.

In the end, retention jobs are completed quickly and consistently in a predictable way and at the minimal cost.

 

Important Caveats

Partition Boundaries Must Match Retention Policy

If partitions contain mixed retention windows, DROP PARTITION may remove too much data. For this reason, ensure correct partition design.

Recommended:

  • daily partitions
  • weekly partitions
  • monthly partitions

aligned with business retention requirements.

Metadata Locks

ALTER TABLE DROP PARTITION still acquires metadata locks.

Test carefully in production.

Backup Awareness

Ensure dropped partitions are no longer needed before removal or use pt-archiver to also copy the data into a remote server or dump the data into a CSV file before running the DROP PARTITION.

 

Possible Enhancements

The plug-in can be extended further.

Potential improvements:

  • Support for daily partitions
  • Support for UNIX timestamp partitions
  • Dry-run reporting
  • Automatic partition creation
  • Push Slack notifications
  • Export Prometheus metrics
  • Safety checks for replicas
  • GTID-aware orchestration
  • Integration with pt-online-schema-change workflows

These are just some ideas I had meanwhile doing my tests. What you can do by implementing a Perl plugin is only limited by your imagination and your real needs.

Conclusion

pt-archiver remains an excellent tool for implementing retention policies and archival workflows.

However, DELETE-based purging becomes increasingly expensive at scale, even with proper indexing and chunked processing.

For large time-series or historical datasets, RANGE partitioning is often a dramatically superior strategy.

The challenge is that pt-archiver does not natively leverage partition-level operations.

Fortunately, its Perl plug-in architecture allows advanced users to extend its behavior and implement partition-aware cleanup logic.

By combining:

  • pt-archiver orchestration
  • MySQL RANGE partitioning
  • Custom Perl plug-ins

Organizations can achieve:

  • Faster retention enforcement
  • Lower operational overhead
  • Smaller replication impact
  • Dramatically improved scalability

For large MySQL deployments, this hybrid approach can turn multi-hour purge operations into near-instant metadata operations.

The use case presented in this article is limited to a specific scenario, but you can reuse it or customize it if you have a different kind of RANGE partitioning, for example, not using TO_DAYS().

Take this as just an example of how you can extend pt-archiver. What you can do for real is driven by your needs and/or only limited by your imagination.

More info about extending pt-archiver:
https://docs.percona.com/percona-toolkit/pt-archiver.html#extending

 

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted

Far
Enough.

Said no pioneer ever.
MySQL, PostgreSQL, InnoDB, MariaDB, MongoDB and Kubernetes are trademarks for their respective owners.
© 2026 Percona All Rights Reserved