Managing data retention policies is one of the most common operational tasks in MySQL.
Applications continuously generate transactional, audit, logging, telemetry, and event data. Over time, these tables can grow to billions of rows, causing:
To address these problems, organizations typically implement retention policies based on dates or timestamps. Examples include deleting events older than 90 days or purging session data older than 30 days and so forth. The deleted data can then eventually be archived somewhere else, like in another DBMS or on external files.
One of the most widely used tools for implementing these policies in MySQL ecosystems is pt-archiver, part of the Percona Toolkit.
This article provides a review of what pt-archiver is and how to use it, but in particular it focuses on the fact this tool is not partitioning aware, and this can make the deletion phase more costly. The article shows how to extend pt-archiver with a Perl plugin to make it aware of partitioning.
pt-archiver is a command-line utility from Percona Toolkit designed to:
In a few words: implementing retention policies safely.
The tool processes rows incrementally in chunks, avoiding massive transactions and reducing impact on production systems.
Example:
|
1 2 3 4 5 6 |
pt-archiver \ --source h=localhost,D=mydb,t=events \ --where "created_at < '2026-05-01'" \ --purge \ --limit 1000 \ --commit-each |
This command:
pt-archiver provides several advantages compared to ad-hoc DELETE statements.
Instead of running:
|
1 2 |
DELETE FROM events WHERE created_at < '2026-05-01'; |
which may:
pt-archiver processes rows incrementally to make the process overhead less impactful for the database performance.
pt-archiver implementation permits flexible archival strategies
Rows can be copied to another table on a remote host, exported to files or removed completely
More details: ps://docs.percona.com/percona-toolkit/pt-archiver.html
The following example archives rows older than 90 days from a local table into an archive table hosted on a remote MySQL server:
|
1 2 3 4 5 6 7 8 |
pt-archiver \ --source h=localhost,D=sales,t=orders,u=archiver,p=secret \ --dest h=archive-server,D=archive,t=orders_archive,u=archiver,p=secret \ --where "created_at < '2026-05-01'" \ --limit 1000 \ --commit-each \ --progress 10000 \ --statistics |
In this example:
–-progress reports progress every 10,000 rows
If rows should be removed from the source table after being copied, add –purge
The following example exports rows older than one year into a text file:
|
1 2 3 4 5 6 7 8 9 |
pt-archiver \ --source h=localhost,D=sales,t=orders,u=archiver,p=secret \ --where "created_at < NOW() - INTERVAL 1 YEAR" \ --file '/tmp/orders_archive_%Y-%m-%d.txt' \ --output-format csv \ --limit 1000 \ --commit-each \ --progress 10000 \ --statistics |
In this example:
Rows can optionally be deleted from the source table by adding –purge
This allows pt-archiver to be used both for data retention and for offline archival workflows.
Although pt-archiver is much safer than massive DELETE operations, it still fundamentally relies on DELETE statements.
This is a critical point.
Even when there are proper indexes, the rows are processed in chunks, and transactions are small; the large-scale DELETE operations remain expensive.
Deleting rows is expensive in InnoDB because it involves:
When deleting billions of rows, the overhead becomes enormous.
Indexes help for sure, but only partially.
Consider:
|
1 2 |
DELETE FROM events WHERE created_at < '2024-01-01'; |
If created_at is indexed, MySQL can efficiently locate rows.
However, locating rows efficiently is only part of the cost. The actual delete operations still require all those things we mentioned above.
At considerable scale, this becomes expensive.
For time-based retention policies, partitioning is often dramatically more efficient. In particular, RANGE partitioning is very useful for these cases.
Example:
|
1 2 3 4 5 6 7 8 9 10 11 12 |
CREATE TABLE events ( id BIGINT NOT NULL, created_at DATETIME NOT NULL, payload JSON, PRIMARY KEY(id, created_at) ) PARTITION BY RANGE (TO_DAYS(created_at)) ( PARTITION p202604 VALUES LESS THAN (TO_DAYS('2026-05-01')), PARTITION p202605 VALUES LESS THAN (TO_DAYS('2026-06-01')), PARTITION p202606 VALUES LESS THAN (TO_DAYS('2026-07-01')) ); |
With partitioning, dropping old data becomes:
|
1 |
ALTER TABLE events DROP PARTITION p202604; |
This operation is dramatically faster than running a DELETE.
Dropping a partition:
This can remove millions or billions of rows in a matter of seconds without the same large cost of DELETE.
Unfortunately, pt-archiver does not automatically understand partitioning strategies.
Even if the table is partitioned or the retention policy perfectly matches partition boundaries, pt-archiver still executes DELETE statements.
Example:
|
1 2 3 |
pt-archiver \ --where "created_at < NOW() - INTERVAL 90 DAY" \ --purge |
Internally, this still produces DELETE … instead of ALTER TABLE … DROP PARTITION …
This means organizations may lose the major operational benefits of partitioning, or they need to implement custom scripts for managing the selection of rows to copy using pt-archiver and then use DROP PARTITION separately from the tool. That is doable, and to be honest, not too complicated, but why not make pt-archiver aware of partitioning for some specific use cases?
Fortunately, pt-archiver supports Perl plug-ins.
A plug-in can do plenty of things. Like: inspect runtime conditions, interact with MySQL, override behaviors, and execute custom logic
This gives us an opportunity to implement partition-aware retention handling.
The plug-in can:
This approach combines the scheduling/orchestration power of pt-archiver with the efficiency of partition pruning.
Our plug-in will:
Assumptions:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 |
package pt_archiver_partition_drop; use strict; use warnings; sub new { my ($class, %args) = @_; my $self = { dbh => $args{dbh}, db => $args{db}, tbl => $args{tbl}, statistics => {}, }; bless $self, $class; return $self; } sub statistics { my ($self) = @_; return $self->{statistics}; } sub before_begin { my ($self) = @_; my $dbh = $self->{dbh} or die "Missing dbh from pt-archiver\n"; my $db = $self->{db} or die "Missing db from pt-archiver plugin args\n"; my $tbl = $self->{tbl} or die "Missing tbl from pt-archiver plugin args\n"; my $where = _get_cmdline_option('where'); my $dryrun = $ENV{PT_PARTITION_DROP_DRY_RUN} ? 1 : 0; die "Missing --where from original command line\n" unless $where; print "PLUGIN before_begin called\n"; print "DB=$db TABLE=$tbl\n"; print "WHERE=$where\n"; print "PLUGIN_DRY_RUN=$dryrun\n"; my ($column, $cutoff_date) = _parse_where($where); my $partitions = _get_partitions($dbh, $db, $tbl); if (!@$partitions) { print "Table `$db`.`$tbl` is not partitioned. Refusing DELETE.\n"; exit(0); } my $partition_expr = $partitions->[0]->{expression}; die "Missing PARTITION_EXPRESSION\n" unless defined $partition_expr && length $partition_expr; print "Partition expression: $partition_expr\n"; my $cutoff_value = _evaluate_cutoff( $dbh, $partition_expr, $column, $cutoff_date, ); print "Cutoff date: $cutoff_date\n"; print "Cutoff boundary value: $cutoff_value\n"; my $matched; for my $p (@$partitions) { next if !defined $p->{description}; next if uc($p->{description}) eq 'MAXVALUE'; if ($p->{description} == $cutoff_value) { $matched = $p; last; } } if (!$matched) { print "No exact partition boundary matches cutoff $cutoff_value. Refusing DELETE.\n"; exit(0); } print "Matched boundary partition: $matched->{name}, position $matched->{position}\n"; my @drop; for my $p (@$partitions) { next if !defined $p->{description}; next if uc($p->{description}) eq 'MAXVALUE'; if ($p->{position} <= $matched->{position}) { push @drop, $p->{name}; print "Eligible for DROP: $p->{name}, boundary $p->{description}\n"; } } if (!@drop) { print "No partitions eligible for DROP. Refusing DELETE.\n"; exit(0); } my $sql = sprintf( "ALTER TABLE %s.%s DROP PARTITION %s", _quote_ident($db), _quote_ident($tbl), join(", ", map { _quote_ident($_) } @drop), ); print "SQL: $sql\n"; if ($dryrun) { print "PT_PARTITION_DROP_DRY_RUN enabled. Not executing DROP PARTITION.\n"; } else { $dbh->do($sql); print "Dropped partitions: " . join(", ", @drop) . "\n"; } $self->{statistics}->{partitions_dropped} = scalar @drop; exit(0); } sub _parse_where { my ($where) = @_; $where =~ s/^\s+|\s+$//g; die "Only WHERE format supported: created_at < 'YYYY-MM-DD'\n" unless $where =~ /^`?([A-Za-z0-9_]+)`?\s*<\s*'(\d{4}-\d{2}-\d{2})'\s*$/; return ($1, $2); } sub _evaluate_cutoff { my ($dbh, $partition_expr, $column, $cutoff_date) = @_; my $expr = $partition_expr; $expr =~ s/`//g; die "Partition expression does not reference column `$column`: $partition_expr\n" unless $expr =~ /\b\Q$column\E\b/i; $expr =~ s/\b\Q$column\E\b/'$cutoff_date'/ig; die "Unsafe generated expression: $expr\n" unless $expr =~ /^[A-Za-z0-9_\s\(\)\+\-\*\/,\.'":]+$/; my $sql = "SELECT $expr"; print "Boundary evaluation SQL: $sql\n"; my ($value) = $dbh->selectrow_array($sql); die "Cannot evaluate cutoff expression: $sql\n" unless defined $value; return $value; } sub _get_partitions { my ($dbh, $db, $tbl) = @_; my $sql = q{ SELECT PARTITION_NAME, PARTITION_DESCRIPTION, PARTITION_EXPRESSION, PARTITION_ORDINAL_POSITION FROM INFORMATION_SCHEMA.PARTITIONS WHERE TABLE_SCHEMA = ? AND TABLE_NAME = ? AND PARTITION_NAME IS NOT NULL ORDER BY PARTITION_ORDINAL_POSITION }; my $sth = $dbh->prepare($sql); $sth->execute($db, $tbl); my @partitions; while (my $row = $sth->fetchrow_hashref()) { push @partitions, { name => $row->{PARTITION_NAME}, description => $row->{PARTITION_DESCRIPTION}, expression => $row->{PARTITION_EXPRESSION}, position => $row->{PARTITION_ORDINAL_POSITION}, }; } return \@partitions; } sub _get_cmdline_option { my ($name) = @_; my $opt = "--$name"; for (my $i = 0; $i < @ARGV; $i++) { if ($ARGV[$i] eq $opt && defined $ARGV[$i + 1]) { return $ARGV[$i + 1]; } if ($ARGV[$i] =~ /^\Q$opt\E=(.*)$/) { return $1; } } if (open my $fh, '<', "/proc/$$/cmdline") { local $/; my $raw = <$fh>; close $fh; my @cmd = split /\0/, $raw; for (my $i = 0; $i < @cmd; $i++) { if ($cmd[$i] eq $opt && defined $cmd[$i + 1]) { return $cmd[$i + 1]; } if ($cmd[$i] =~ /^\Q$opt\E=(.*)$/) { return $1; } } } return undef; } sub _quote_ident { my ($ident) = @_; die "Invalid identifier: $ident\n" unless defined $ident && $ident =~ /^[A-Za-z0-9_]+$/; return "`$ident`"; } 1; |
Create the file named pt_archiver_partition_drop.pm into the /usr/local/share/perl5 path.
Also set the environment variable PERL5LIB to let pt-archiver where to find the Perl package
|
1 |
export PERL5LIB=/usr/local/share/perl5 |
First, create the partitioned table events and insert some fake data.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
DROP TABLE IF EXISTS events; CREATE TABLE events ( id BIGINT NOT NULL, created_at DATETIME NOT NULL, payload JSON DEFAULT NULL, PRIMARY KEY (id, created_at) ) PARTITION BY RANGE (TO_DAYS(created_at)) ( PARTITION p202604 VALUES LESS THAN (TO_DAYS('2026-05-01')), PARTITION p202605 VALUES LESS THAN (TO_DAYS('2026-06-01')), PARTITION p202606 VALUES LESS THAN (TO_DAYS('2026-07-01')), PARTITION pmax VALUES LESS THAN MAXVALUE ); INSERT INTO events (id, created_at, payload) VALUES -- p202604 (1, '2026-04-01 08:00:00', JSON_OBJECT('event', 'login', 'user', 'alice')), (2, '2026-04-03 09:15:00', JSON_OBJECT('event', 'view', 'page', 'home')), (3, '2026-04-05 10:30:00', JSON_OBJECT('event', 'click', 'button', 'signup')), (4, '2026-04-08 11:45:00', JSON_OBJECT('event', 'search', 'term', 'mysql')), (5, '2026-04-10 12:00:00', JSON_OBJECT('event', 'purchase', 'amount', 100)), (6, '2026-04-14 13:20:00', JSON_OBJECT('event', 'logout', 'user', 'alice')), (7, '2026-04-18 14:35:00', JSON_OBJECT('event', 'download', 'file', 'report.pdf')), (8, '2026-04-22 15:50:00', JSON_OBJECT('event', 'upload', 'file', 'image.png')), (9, '2026-04-26 16:05:00', JSON_OBJECT('event', 'click', 'button', 'buy')), (10, '2026-04-30 23:59:59', JSON_OBJECT('event', 'month_end')), -- p202605 (11, '2026-05-01 00:00:00', JSON_OBJECT('event', 'login', 'user', 'bob')), (12, '2026-05-03 08:10:00', JSON_OBJECT('event', 'view', 'page', 'pricing')), (13, '2026-05-06 09:20:00', JSON_OBJECT('event', 'search', 'term', 'percona')), (14, '2026-05-09 10:30:00', JSON_OBJECT('event', 'purchase', 'amount', 250)), (15, '2026-05-12 11:40:00', JSON_OBJECT('event', 'logout', 'user', 'bob')), (16, '2026-05-16 12:50:00', JSON_OBJECT('event', 'download', 'file', 'backup.sql')), (17, '2026-05-20 13:00:00', JSON_OBJECT('event', 'upload', 'file', 'data.csv')), (18, '2026-05-24 14:10:00', JSON_OBJECT('event', 'click', 'button', 'subscribe')), (19, '2026-05-28 15:20:00', JSON_OBJECT('event', 'view', 'page', 'docs')), (20, '2026-05-31 23:59:59', JSON_OBJECT('event', 'month_end')), -- p202606 (21, '2026-06-01 00:00:00', JSON_OBJECT('event', 'login', 'user', 'carol')), (22, '2026-06-03 08:05:00', JSON_OBJECT('event', 'search', 'term', 'partitioning')), (23, '2026-06-06 09:15:00', JSON_OBJECT('event', 'view', 'page', 'dashboard')), (24, '2026-06-09 10:25:00', JSON_OBJECT('event', 'purchase', 'amount', 500)), (25, '2026-06-12 11:35:00', JSON_OBJECT('event', 'logout', 'user', 'carol')), (26, '2026-06-16 12:45:00', JSON_OBJECT('event', 'login', 'user', 'dave')), (27, '2026-06-20 13:55:00', JSON_OBJECT('event', 'download', 'file', 'archive.zip')), (28, '2026-06-24 14:05:00', JSON_OBJECT('event', 'upload', 'file', 'video.mp4')), (29, '2026-06-28 15:15:00', JSON_OBJECT('event', 'click', 'button', 'checkout')), (30, '2026-06-30 23:59:59', JSON_OBJECT('event', 'month_end')), -- pmax (31, '2026-07-01 00:00:00', JSON_OBJECT('event', 'login', 'user', 'eve')), (32, '2026-07-05 08:30:00', JSON_OBJECT('event', 'view', 'page', 'future')), (33, '2026-07-10 09:45:00', JSON_OBJECT('event', 'search', 'term', 'maxvalue')), (34, '2026-08-01 10:00:00', JSON_OBJECT('event', 'purchase', 'amount', 750)), (35, '2026-09-01 11:15:00', JSON_OBJECT('event', 'retained_future')); |
Now you can run the following command to delete all rows before the 1st of May, which, by the way, matches the entire first partition in the table.
|
1 2 3 4 |
pt-archiver \ --source h=localhost,D=mydb,t=events,m=pt_archiver_partition_drop \ --where "created_at < '2026-05-01'" \ --purge |
Notice the Perl plugin must be indicated with the m option in the DSN string.
In practice:
Here is what you get from the execution of the above command:
|
1 2 3 4 5 6 7 8 9 10 11 12 |
PLUGIN before_begin called DB=mydb TABLE=events WHERE=created_at < '2026-05-01' PLUGIN_DRY_RUN=0 Partition expression: to_days(`created_at`) Boundary evaluation SQL: SELECT to_days('2026-05-01') Cutoff date: 2026-05-01 Cutoff boundary value: 740102 Matched boundary partition: p202604, position 1 Eligible for DROP: p202604, boundary 740102 SQL: ALTER TABLE `mydb`.`events` DROP PARTITION `p202604` Dropped partitions: p202604 |
You can simply verify the table has been managed correctly:
SELECT * FROM mydb.events;
SHOW CREATE TABLE mydb.events;
Now TRUNCATE the table and recreate the data and try now to specify the where conditions that match a RANGE that is not the first in the list of the boundaries.
|
1 2 3 4 |
pt-archiver \ --source h=localhost,D=mydb,t=events,m=pt_archiver_partition_drop \ --where "created_at < '2026-06-01'" \ --purge |
You should get:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 |
PLUGIN before_begin called DB=mydb TABLE=events WHERE=created_at < '2026-06-01' PLUGIN_DRY_RUN=0 Partition expression: to_days(`created_at`) Boundary evaluation SQL: SELECT to_days('2026-06-01') Cutoff date: 2026-06-01 Cutoff boundary value: 740133 Matched boundary partition: p202605, position 2 Eligible for DROP: p202604, boundary 740102 Eligible for DROP: p202605, boundary 740133 SQL: ALTER TABLE `mydb`.`events` DROP PARTITION `p202604`, `p202605` Dropped partitions: p202604, p202605 |
In this case, two partitions have been identified and dropped.
Truncate the table and recreate the data again. Try now to provide a WHERE condition that does not match any of the boundaries in the RANGE.
|
1 2 3 4 |
pt-archiver \ --source h=localhost,D=mydb,t=events,m=pt_archiver_partition_drop \ --where "created_at < '2026-04-25'" \ --purge |
You get the following:
|
1 2 3 4 5 6 7 8 9 |
PLUGIN before_begin called DB=mydb TABLE=events WHERE=created_at < '2026-04-25' PLUGIN_DRY_RUN=0 Partition expression: to_days(`created_at`) Boundary evaluation SQL: SELECT to_days('2026-04-25') Cutoff date: 2026-04-25 Cutoff boundary value: 740096 No exact partition boundary matches cutoff 740096. Refusing DELETE. |
As expected, the tool now refuses to execute anything if it doesn’t find an exact match.
This approach provides major advantages.
Dropping partitions is vastly faster than deleting rows, and minimal binary logging is needed, compared to billions of row deletes. There is no massive transactional overhead for managing undo logs and purging. You get then a better InnoDB Buffer Pool stability because of less page churn.
In the end, retention jobs are completed quickly and consistently in a predictable way and at the minimal cost.
If partitions contain mixed retention windows, DROP PARTITION may remove too much data. For this reason, ensure correct partition design.
Recommended:
aligned with business retention requirements.
ALTER TABLE DROP PARTITION still acquires metadata locks.
Test carefully in production.
Ensure dropped partitions are no longer needed before removal or use pt-archiver to also copy the data into a remote server or dump the data into a CSV file before running the DROP PARTITION.
The plug-in can be extended further.
Potential improvements:
These are just some ideas I had meanwhile doing my tests. What you can do by implementing a Perl plugin is only limited by your imagination and your real needs.
pt-archiver remains an excellent tool for implementing retention policies and archival workflows.
However, DELETE-based purging becomes increasingly expensive at scale, even with proper indexing and chunked processing.
For large time-series or historical datasets, RANGE partitioning is often a dramatically superior strategy.
The challenge is that pt-archiver does not natively leverage partition-level operations.
Fortunately, its Perl plug-in architecture allows advanced users to extend its behavior and implement partition-aware cleanup logic.
By combining:
Organizations can achieve:
For large MySQL deployments, this hybrid approach can turn multi-hour purge operations into near-instant metadata operations.
The use case presented in this article is limited to a specific scenario, but you can reuse it or customize it if you have a different kind of RANGE partitioning, for example, not using TO_DAYS().
Take this as just an example of how you can extend pt-archiver. What you can do for real is driven by your needs and/or only limited by your imagination.
More info about extending pt-archiver:
https://docs.percona.com/percona-toolkit/pt-archiver.html#extending
Resources
RELATED POSTS