Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Faster Point In Time Recovery (PITR) in PostgreSQL Using a Delayed Standby

June 28, 2018

Author

Avinash Vallarapu

Insight for DBAs

PostgreSQL

Share this Post:

PostgreSQL Point in Time Recovery The need to recover a database back to a certain point in time can be a nerve-racking task for DBAs and for businesses. Can this be simplified? Could it be made to work faster? Can we recover to a given point in time with zero loss of transactions/records? Fortunately, the answer to these questions is yes. PostgreSQL Point in Time Recovery (PITR) is an important facility. It offers DBAs the ability to restore a PostgreSQL database simply, quickly and without the loss of transactions or data.

In this post, we’ll help you to understand how this can be achieved, and reduce the potential for pain in the event of panic situations where you need to perform a PITR.

Before proceeding further, let us understand what could force us to perform a PITR.

1. Someone has accidentally dropped or truncated a table.

1. A failed deployment has made changes to the database that are difficult to reverse.

1. You accidentally deleted or modified a lot of data, and as a consequence, you cannot run your applications.

In such scenarios, you would immediately look for the latest full backup and the relevant transaction logs (aka WALs in PostgreSQL) to recover up to a known point in the past, before the error occurred. But what if your backup is corrupt and not valid?

Well, it is very important to perform a backup and recovery validation to ensure that the backups are always recoverable—we will address this in a future post. But, if the backup that you are looking at is corrupt, that can be a nightmare. One such unlucky incident for GitLab, where there was a backup restoration failure, caused a major outage followed by a data loss after recovery.

https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-incident/

Even the best of plans can be hard to realize in practice.

It may be that our backups are intact and recoverable. Can we afford to wait until we copy/download the backup and recover it to another disk or server? What if the database size is several hundreds of GBs or several TBs like GitLab’s?

The solution to the problem is: add another standby that is always delayed by a few hours or a day.

This is one of the great features available in PostgreSQL. If you have migrated from Oracle RDBMS to PostgreSQL, you can think of it as an equivalent to FLASHBACK DATABASE in Oracle. Flashback database helps you to rewind data back in time. However, the technique does not work if you have dropped a data file. In fact, this is the case for both Oracle RDBMS and PostgreSQL PITR. 🙁

https://docs.oracle.com/cd/E11882_01/backup.112/e10642/flashdb.htm#BRADV71000

Adding a Delayed Standby in PostgreSQL

It is important that we use features like streaming replication to achieve high availability in PostgreSQL. Most of the environments have 1 master with 1 or more slaves (standby), either in the same data center or geographically distributed. To save the time needed for PITR, you can add another slave that can always be delayed by a certain amount of time—this could be hours or days.

For example, if I know that my deployment is determined to be successful when no issues are observed in the first 12 hours, then I might delay one of the standbys by 12 hours.

To delay a standby, once you have set up streaming replication between your PostgreSQL master and slave, the following parameter needs to be added to the recovery.conf file of the slave, followed by a restart.

recovery_min_apply_delay = '12h' # or '1min' or 1d'

1	recovery_min_apply_delay = '12h' # or '1min' or 1d'

Now, let’s consider an example where you have inserted 10000 records at 10:27:34 AM and you have accidentally deleted 5000 records at 10:28:43 AM. Let’s say that you have a standby that is delayed by 1 hour. The steps to perform PITR using the delayed standby through until 10:27:34 AM look like this:

Steps to perform PostgreSQL Point in Time Recovery using a delayed standby

Step 1

Stop the slave (delayed standby) immediately, as soon as you have noticed that an accidental change has happened. If you know that the change has been already applied on the slave, then you cannot perform the point in time recovery using this method.

$ pg_ctl -D $PGDATA stop -mf

1	$ pg_ctl -D $PGDATA stop -mf

Step 2

Rename the recovery.conf file in your standby to another name.

$ mv $PGDATA/recovery.conf $PGDATA/recovery.conf.old

1	$ mv $PGDATA/recovery.conf $PGDATA/recovery.conf.old

Step 3

Create a new recovery.conf file with the required parameters for PITR.

# recovery.conf file always exists in the Data Directory of Slave
recovery_target_time = '2018-06-07 10:27:34 EDT'
restore_command = 'sh /var/lib/pgsql/scripts/restore_command_script.sh %p %f'
recovery_target_action = 'pause'
recovery_target_inclusive = 'false'

# recovery.conf file always exists in the Data Directory of Slave

recovery_target_time = '2018-06-07 10:27:34 EDT'

restore_command = 'sh /var/lib/pgsql/scripts/restore_command_script.sh %p %f'

recovery_target_action = 'pause'

recovery_target_inclusive = 'false'

recovery_target_time

Specifies the timestamp up to which you wish to recover your database.

restore_command

Shell command that can be used by PostgreSQL to fetch the required Transaction Logs (WALs) for recovery.
PostgreSQL sends the arguments %p (path to WAL file) and % f (WAL file name) to this shell command. These arguments can be used in the script you use to copy your WALs.

Here is an example script for your reference. This example relies on rsync. The script connects to the backup server to fetch the WALs requested by PostgreSQL. (We’ll cover the procedure to archive these WALs in another blog post soon: this could be a good time to subscribe to the Percona blog mailing list!)

$ cat /var/lib/pgsql/scripts/restore_command_script.sh
#!/bin/bash
# Enable passwordless ssh to Backup Server
# $1 is %p substituted by postgres as the path to WAL File
# $2 is the %f substituted by postgres as the WAL File Name
LOG=/var/lib/pgsql/scripts/restore_command.log
Backup_Server=192.168.0.12
ArchiveDir='/archives'
#
wal=$2
wal_with_path=$1
rsync --no-motd -ave ssh ${Backup_Server}:${ArchiveDir}/${wal} ${wal_with_path} >>$LOG 2>&1
if [ "$?" -ne "0" ]
then
echo "Restore Failed for WAL : $wal" >> $LOG
exit 1
fi
#

$ cat /var/lib/pgsql/scripts/restore_command_script.sh

#!/bin/bash

# Enable passwordless ssh to Backup Server

# $1 is %p substituted by postgres as the path to WAL File

# $2 is the %f substituted by postgres as the WAL File Name

LOG=/var/lib/pgsql/scripts/restore_command.log

Backup_Server=192.168.0.12

ArchiveDir='/archives'

wal=$2

wal_with_path=$1

rsync --no-motd -ave ssh ${Backup_Server}:${ArchiveDir}/${wal} ${wal_with_path} >>$LOG 2>&1

if [ "$?" -ne "0" ]

then

echo "Restore Failed for WAL : $wal" >> $LOG

exit 1

recovery_target_action

This is the action that needs to be performed after recovering the instance up to the recovery_target_time. Setting this to pause would let you modify the recovery_target_time after recovery if you need to. You can then replay the transactions at a slow pace until your desired recovery target is reached. For example, you can recover until 2018-06-07 10:26:34 EDT and then modify recovery_target_time to 2018-06-07 10:27:34 EDT when using pause.

When you know that all the data you are looking for has been recovered, you can issue the following command to stop the recovery process, change the timeline and open the database for writes.

select pg_wal_replay_resume();

1	select pg_wal_replay_resume();

Other possible settings for this parameter are promoted and shutdown. These do not allow you to replay a few more future transactions after the recovery, as you can with a pause.

recovery_target_inclusive

Whether to stop recovery just after the specified recovery_target_time(true) or before(false).

Step 4

Start PostgreSQL using pg_ctl. Now, it should read the parameters in recovery.conf and perform the recovery until the time you set in the recovery_target_time.

$ pg_ctl -D $PGDATA start

1	$ pg_ctl -D $PGDATA start

Step 5

Here is how the log appears. It says that has performed point-in-time-recovery and has reached a consistent state as requested.

2018-06-07 10:43:22.303 EDT [1423] LOG: starting point-in-time recovery to 2018-06-07 10:27:34-04
2018-06-07 10:43:22.607 EDT [1423] LOG: redo starts at 0/40005B8
2018-06-07 10:43:22.607 EDT [1423] LOG: consistent recovery state reached at 0/40156B0
2018-06-07 10:43:22.608 EDT [1421] LOG: database system is ready to accept read only connections
2018-06-07 10:43:22.626 EDT [1423] LOG: recovery stopping before commit of transaction 570, time 2018-06-07 10:28:59.645685-04
2018-06-07 10:43:22.626 EDT [1423] LOG: recovery has paused
2018-06-07 10:43:22.626 EDT [1423] HINT: Execute pg_wal_replay_resume() to continue.

2018-06-07 10:43:22.303 EDT [1423] LOG: starting point-in-time recovery to 2018-06-07 10:27:34-04

2018-06-07 10:43:22.607 EDT [1423] LOG: redo starts at 0/40005B8

2018-06-07 10:43:22.607 EDT [1423] LOG: consistent recovery state reached at 0/40156B0

2018-06-07 10:43:22.608 EDT [1421] LOG: database system is ready to accept read only connections

2018-06-07 10:43:22.626 EDT [1423] LOG: recovery stopping before commit of transaction 570, time 2018-06-07 10:28:59.645685-04

2018-06-07 10:43:22.626 EDT [1423] LOG: recovery has paused

2018-06-07 10:43:22.626 EDT [1423] HINT: Execute pg_wal_replay_resume() to continue.

Step 6

You can now stop recovery and open the database for writes after PITR.

Before executing the next command, you may want to verify that you have got all the desired data by connecting to the database and executing some SQL’s. You can still perform reads before you stop recovery. If you notice that you need another few minutes (or hours) of transactions, then modify the parameter recovery_target_time and go back to step 4. Otherwise, you can stop the recovery by running the following command.

$ psql
select pg_wal_replay_resume();

1 2	$ psql select pg_wal_replay_resume();

Summing up

Using PostgreSQL Point in time Recovery is the most simple of procedures that does not involve any effort in identifying the latest backups, transaction logs, and space or server to restore in a database emergency. These things happen! Also, it could save a lot of time because the replay of WALs is much faster than rebuilding an entire instance using backups, especially when you have a huge database.

Important post script: I tested and recorded these steps using PostgreSQL 10.4. It is possible with PostgreSQL 9.x versions, however, the parameters could change slightly and you should refer to the PostgreSQL documentation for the correct syntax.

With so many open-source database options available, businesses often wonder how they will know which option is most suited for their needs. We examine these various options through the lens of four general categories: Support, Consulting, Managed DBA Services, and Training. Our experts detail the information you need to make an informed choice for your database environment.

When your application goes down, it impacts your business’s bottom line. Fortunately for PostgreSQL databases, Point-In-Time-Recovery (PITR) is available. With PITR, you can undo an accidental change, so you don’t have to worry about losing the data you’ve added to your database since your last full backup. PostgreSQL can also be configured to manage PITR strategies more effectively. To learn more, read our white paper: Efficient Point-in-Time Recovery in PostgreSQL. For insight on how to configure PostgreSQL for high availability, see: PostgreSQL High Availability.

0 0 votes

Article Rating

9 Comments

Oldest

Newest Most Voted

Riyaz Uddin

8 years ago

Excellent Post

Konstantin Kondakov

8 years ago

That’s very important update and safety measure. Must read for all DevOps. Bookmarked..

8 years ago

is there any way to flashback table level? suppose I have dropped one table at 9AM and notices at 9:15. If I do the above steps I can recover till 9AM(befor the drop command).. but I will loose 15 min data right?

Author

avivallarapu

8 years ago

Reply to kc

Yes, this solution works good for recovering an Individual Table. Considering that you have a Delayed Standby(at least delayed by 1 hour) :
1. Recover the delayed Standby until 9:00am(using these steps)
2. Take a Dump of this Table from Delayed Standby and restore it on Master.
3. This way, you get your Table back and you should not loose data.

Much much faster when compared to a full Cluster PITR and lowest possible downtime.

saz

8 years ago

Or just use https://www.pgbarman.org/

Author

avivallarapu

8 years ago

Reply to saz

Absolutely. There are several backup solutions available to help us with PITR.
1. pg_basebackup + Continuous Archiving
2. pgBarman
3. pgBackRest

However, this is a completely different approach that allows us to recover to a certain point in time by just replaying a few WALs and can be considered much faster than all the 3 options mentioned above.

Cervisia

7 years ago

Why the restore_command? Is there any case where the standby server would not already have all the required WAL entires?

Author

avivallarapu

7 years ago

Reply to Cervisia

Sometimes, depending on the requirement, we may choose a larger delay time. For example, let us say, a Standby is delayed by a day or several hours. If you see huge amount of WALs that cannot be stored on the disk, you may have your retention policies on the number of WALs on your Disk. So you rather use restore_command to fetch those WALs.