WAL retention is a very important topic for PostgreSQL database management. But very often we come across DBAs getting into surprise situations such as:
1. Several TBs of WALs piled up in archive destination
2. WALs filling up pg_wal/pg_xlog directory due to failing archive
3. Necessary WALs are no longer preserved
External backup projects for PostgreSQL are good in addressing retention policies. But there is a simple program named pg_archivecleanup which comes along with PostgreSQL binaries which might be useful in both maintaining the WAL retention as well as handling an emergency situation. pg_archivecleanup is probably the least talked about utility among standard PostgreSQL binaries. It is extremely simple and useful, and it can work in 2 scenarios:
- Can be used as a standalone program to clean up old WAL files from any file system location.
- It can also be used in Standby side for cleaning up WAL files which are no longer required for Standby.
This program can operate in dryrun mode (-n option) or delete mode (-d option).
This feature is useful to list all WALs older than a specific WAL. In the following demonstration, I am listing all WALs which are still in the default wal location (pg_wal) inside my data directory and older than a specific WAL:
$ ./pg_archivecleanup -n ~/bigsql/data/pg11/pg_wal 00000001000000000000001E
Delete Mode (-d)
In this mode, pg_archivecleanup does the cleanup by removing all the candidate WALs:
$ pg_archivecleanup -d /home/postgres/archive 00000002000000000000006B
pg_archivecleanup: keeping WAL file "/home/postgres/archive/00000002000000000000006B" and later
pg_archivecleanup: removing file "/home/postgres/archive/000000020000000000000069"
pg_archivecleanup: removing file "/home/postgres/archive/00000002000000000000006A"
Understanding WAL retention criteria
In order to do a clean up of WALs, we need to keep in mind the WAL retention requirements and criteria, as losing essential WALs can be disastrous.
Criteria 1: All WALs from the time of the oldest backup, which still falls in the backup retention policy and Point-in-time requirement, need to be preserved.
For example, a user may ask for restore like: A table X was accidentally deleted as part of last change rolled out on Sunday night 12. Restore the database to another location and recover the table as of latest time possible. In such a case, the DBA needs to get an old backup and associated WAL files.
Criteria 2: All WALs from the start position of a backup until the end of the backup is required to consistently restore any backup. Many times, backup tools make a copy of WAL and include it as part of Backup. Without it, the backups will be invalid. If the backup tool/script is not taking care of WALs, the responsibility falls on us as a DBA.
Criteria 3: All WALs from the restart point of Standby is required by the Standby database. A restart point is a point from where standby can restart the recovery operation. This is similar to Checkpoint on the Primary side.
Criteria 4: All the WALs from the last checkpoint is required by Primary for any crash recovery.
pg_archivecleanup doesn’t have all the intelligence built-in, but it can be utilized inside custom scripts as an intelligent way to achieve the goals of WAL retention. Now think about how to implement all these criteria for WAL retention in archive location.
Criteria No.1 is simple to implement. Any file older than the backup retention period is a candidate for deletion.
Criteria No. 2 will be a little more complex. But luckily PostgreSQL helps us by generating a backup history file in the archive location with a .backup extension. For example:
Any WALs older than this is not required for backup. This backup history file can be directly specified pg_archivelogcleanup as follows:
$ pg_archivecleanup -d /home/postgres/archive 000000020000000000000069.000003E0.backup
pg_archivecleanup: keeping WAL file "/home/postgres/archive/000000020000000000000069" and later
pg_archivecleanup: removing file "/home/postgres/archive/00000002000000000000005D"
pg_archivecleanup: removing file "/home/postgres/archive/00000002000000000000005E"
Criteria No. 3 needs to be taken care of by Standby because only Standby knows what the current restart point is. Again, a built-in feature in PostgreSQL helps us for calling a cleanup shell script. A shell script can be specified for parameter archive_cleanup_command in recovery.conf. This feature can be used for calling pg_archivecleanup like:
archive_cleanup_command = 'pg_archivecleanup archivelocation %r'
However, things will get complicated if there is more than one standby server referring to the same archive location. One standby might be lagging behind and wants a WAL which is already cleaned by another standby. In such cases, we may have to use custom scripts/program which embeds pg_archivecleanup.
Criteria No. 4 requires information from the control file about the WAL file location of the checkpoint.
$ pg_controldata -D /var/lib/pgsql/11/data/
pg_control version number: 1100
Catalog version number: 201809051
Database system identifier: 6676014631308187830
Database cluster state: in production
pg_control last modified: Friday 05 July 2019 12:59:32 PM UTC
Latest checkpoint location: 3/430324A0
Latest checkpoint's REDO location: 3/37752C70
Latest checkpoint's REDO WAL file: 000000020000000300000037
Latest checkpoint's TimeLineID: 2
Latest checkpoint's PrevTimeLineID: 2
All these different criteria create a set of requirements which can be addressed by custom scripts or are already addressed by different backup solutions for PostgreSQL. If we want to implement WAL retention policies with its own logic, we may end up executing pg_archivecleanup from multiple wrapper scripts. We may also think about adding more capabilities to pg_archivecleanup for addressing custom requirements.
**pg_archivecleanup can also take .partial file which gets generated as part of standby promotion as a reference WAL.
pg_archivecleanup code walkthrough for the novice user
Luckily this is a very small, standalone, single file, (no need of postgresql server), easy to read source code with less than 400 lines. It is designed for user customization, including extensive comments. The pg_archivecleanup.c file can be located in the /src/bin/pg_archivecleanup directory. Since it is a standalone program, users with minimal C knowledge can easily customize the program. So this writeup is about the walkthrough of the code and its normal functionality.
A couple of important variables which this program uses are:
Initialize(void): As the name indicates, this function will be called only once as part of the initialization of the program. It checks whether the directory location provided in the command line (value of archiveLocation ) is a directory. If not, it will error out.
TrimExtension(char *filename, char *extension): This function will be called for every file in the directory for the removal of extension. The filename will be trimmed for removing the extension.
CleanupPriorWALFiles(void): This function opens the directory and iterates over all the files in that directory. exclusiveCleanupFileName is used as a reference for comparison. Any filename smaller than this file will be considered for deletion.
Now the question remains is how to get the exclusiveCleanupFileName, which is the reference for comparison.
SetWALFileNameForCleanup(void): Exactly addresses the above requirement; setting the exclusiveCleanupFileName. The user might be giving a WAL segment file without extensions – sometimes .partial WAL segment file, sometimes .backup file. In this function, the timeline number, logical wal/log number, and wal segment number are extracted from the file name specified and reconstitutes to form a valid WAL segment file using XLogFileNameById. m
Overall flow is : Initialize, SetWALFileNameForCleanup, and then CleanupPriorWALFiles.
Enjoy the freedom offered by PostgreSQL to modify and customize the program as a user. Happy Coding!