Transaction ID Wraparound occurs when the VACUUM process cannot keep up with database activity and the PostgreSQL service is forced to shut down.
In more technical parlance: Transaction ID Wraparound occurs when the semantics of Multi-Version Concurrency Control (MVCC) fail and when the number of unique transaction ids reaches its maximum which numbers about two billion.
What leads up to this situation is when the VACUUM process managed by either the autovacuum workers or user-interaction (manual) does not keep up with the DML operations.
Transaction ID Wraparound can be caused by a combination of one or more of the following circumstances:
Transaction WRAPAROUND can cause a spontaneous shutdown of the Postgres database server in order to protect the integrity of the data.
PostgreSQL at any one time has a number of transactions that are tracked by a unique ID. Every so often that number reaches the upper limit that can be registered, for example, 200 million transactions which is the default and is then renumbered. But if the number of unique transaction IDs goes to its maximum transactions limit, known as TXID Wraparound, Postgres will force a shutdown in order to protect the data.
Here’s how it works:
In the case of the autovacuum daemon falling behind across the entire data cluster, review your monitoring solution in order to identify the trend of these metrics:
Mitigation steps include:
Here are example error messages that you can find in the Postgres logs when threatened by a shutdown due to WRAPAROUND:
|
1 |
#<br># When less than 10 million transactions remain before shutdown<br>#<br>WARNING: database "mydb" must be vacuumed within 177009986 transactions<br>HINT: To avoid a database shutdown, execute a database-wide VACUUM in "mydb". |
Here’s a set of queries that will help you determine if WRAPAROUND is a risk:
|
1 |
--<br> -- Database query for transaction age per database<br> -- and as a percentage of maximum permitted transactions<br> --<br>SELECT datname,<br> age(datfrozenxid),<br> (age(datfrozenxid)::numeric/1000000000*100)::numeric(4,2) as "% WRAPAROUND RISK"<br>FROM pg_database ORDER BY 2 DESC; |
|
1 |
--<br>-- Database query for transaction age per table<br>--<br>SELECT <br>c.oid::regclass as table_name,<br>greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as "TXID age",<br>(greatest(age(c.relfrozenxid),age(t.relfrozenxid))::numeric/1000000000*100)::numeric(4,2) as "% WRAPAROUND RISK"<br>FROM pg_class c<br>LEFT JOIN pg_class t ON c.reltoastrelid = t.oid<br>WHERE c.relkind IN ('r', 'm')<br>ORDER BY 2 DESC; |
|
1 |
--<br>-- Other vacuum runtime parameters of interest<br>-- returning TXID age<br>--<br>SELECT name, setting<br>FROM pg_settings<br>WHERE name ~ 'vacuum'<br>AND name ~'_age$'<br>ORDER BY 1 ASC; |
First and foremost, make certain all tables are regularly vacuumed. A correctly configured autovacuum process takes care of this without it ever becoming an issue. Otherwise, you will need to consider a manual VACUUM strategy.
The following are merely suggestions since each situation is highly subjective.
If you have the time, run the following invocation of vacuumdb. The value for option ‘-j’ can vary from a couple to a value equal to the number of CPUs on the host. The option ‘-a’ will process each database in alphabetical order.
|
1 |
vacuumdb -F -a -z -j 10 -v |
Consider a bash script targeting individual databases if you see one is more urgent than another:
|
1 |
vacuumdb -z -j 10 -v <mydatabase> |
The following is the set of actions to take when TRANSACTION WRAPAROUND is imminent. Remember, you are in a race against time.
You must vacuum the entire data cluster before the remaining available transaction id drops to 1 million transactions.
Here’s a pair of example scripts that you can use as a starting point when developing your own mitigation protocol.
The secret sauce is xargs which enables one to utilize as many CPUs as reasonably possible. The following pair of bash scripts invoke vacuumdb against a series of tables. Of course, there’s more than one way to do this.
Script one generates a list of tables in a selected database and calls script two which executes the VACUUM on each of those tables individually.
SCRIPT ONE (go1_highspeed_vacuum.sh)
|
1 |
#!/bin/bash<br>#<br># INVOCATION<br># EX: ./go1_highspeed_vacuum.sh<br>#<br><br>########################################################<br># EDIT AS REQUIRED<br>export CPU=4<br>export PAGER=less PGUSER=postgres PGPASSWORD=mypassword PGDATABASE=db01 PGOPTIONS='-c statement_timeout=0'<br>########################################################<br><br>SQL1="<br>with a as (select c.oid::regclass as table_name,<br> greatest(age(c.relfrozenxid),age(t.relfrozenxid))<br> from pg_class c<br> left join pg_class t on c.reltoastrelid = t.oid<br> where c.relkind in ('r', 'm')<br> order by 2 desc)<br>select table_name from a<br>"<br><br><br>LIST="$(echo "$SQL1" | psql -t)"<br><br># the 'P' sets the number of CPU to use simultaneously<br>xargs -t -n 1 -P $CPU ./go2_highspeed_vacuum.sh $PGDATABASE<<<$LIST<br><br>echo "$(date): DONE" |
SCRIPT TWO (go2_highspeed_vacuum.sh)
|
1 |
#!/bin/bash<br><br>########################################################<br># EDIT AS REQUIRED<br>export PAGER=less PGUSER=postgres PGPASSWORD=mypassword PGOPTIONS='-c statement_timeout=0'<br>export DB=$1<br><br>########################################################<br><br>vacuumdb --verbose ${DB} > ${DB}.log 2>&1 |
One recovers from a forced shutdown due to transaction id wraparound by performing a cluster-wide vacuum in single-user mode:
Log in to the host and as UNIX user “postgres” execute an invocation that is something similar:
|
1 |
# it is understood that environment <br> # variable PGDATA points to the data cluster<br> #<br> postgres --single -D $PGDATA postgres <<< 'vacuum analyze' |
I would suggest scripting the vacuum process because you’ll need to log in to each database to perform the VACUUM.
Generate and edit a list of all the databases:
|
1 |
postgres --single -D $PGDATA postgres <<< 'select datname from pg_database' <br> | grep '"' | cut -d '"' -f 2 > list_db |
Here is an example using the aforementioned list “list_db”:
|
1 |
#<br># it is understood the database list has <br><br># been edited before invoking this code snippet<br>#<br>for u in $(cat list_db)<br>do<br> postgres --single -D $PGDATA $u <<< 'vacuum analyze'<br>done |
TXID Wraparound is one of the scariest scenarios that can occur. Thankfully, this is an extremely rare incident and only occurs when systems are either under extremely heavy load or have been neglected.
Don’t get caught!
Remember: the best DBA is the one that’s never noticed. 🙂
Resources
RELATED POSTS