We recently helped migrate a production PostgreSQL database running on a given DBaaS platform… to another DBaaS platform. When it comes to vendor “lock-in”, some providers are easier (and even friendlier) to deal with than others, but it is never a straightforward process. While in a traditional environment we would usually approach this problem by creating a replica of the server and switching over database requests at the right moment, the situation complicates when it is not possible to create a direct replication flow outside the provider’s platform.
In this post, we’ll share the approach we used for this particular case. It is not intended to provide a step-by-step, one-size-fits-all solution, but we hope it might serve you as “inspiration” when dealing with a similar situation. You may find some parts of our approach more useful than others; consider the ones that suit your unique needs and be creative. As always, regardless of the plan you design, make sure you test all involved steps in a testing/staging environment first as in practice we often step into yet another stone we didn’t know was laying in our path.
Intermediating DBaaS Instances
When we cannot have direct replication between DBaaS instances, we can still attempt to “chain it” somehow through an intermediate player (read: server):
source DBaaS → Intermediate server → target DBaaS
In this particular case, the source DBaaS was producing daily base backups and streaming WAL segments to externally accessible storage, an S3 bucket, for archiving purposes. We took advantage of this to create an intermediate PostgreSQL server from the latest base backup available and configured it to consume the WAL segments available on this storage. Be mindful of where to install this intermediate server: you may want to install it either closer to the source DBaaS or yet closer to the target DBaaS instead. When making your choice, consider network bandwidth as well as the availability of common tools by the infrastructure provider.
However, this intermediate server was not an exact PostgreSQL replica: even though it had to operate with hot_standby enabled it would only be as up-to-date as the latest WAL file it had available to consume. And since it had to operate as a standby server we couldn’t have the target DBaaS replicate directly from it using logical replication; we would have to wait and only set it as a replica once the intermediate PostgreSQL server operating in standby mode had consumed all WAL files and was promoted as primary. To minimize application downtime, the trick here was to perform the migration in two steps:
- Bring the intermediate server as in sync with the source DBaaS as possible, stop all connections to the source DBaaS, allow the intermediate server to consume the latest WAL segments generated, promote it as primary, and finally direct database requests to it.
- Configure the target DBaaS as a replica of the intermediate-now-turned-primary server using logical replication, wait until this new replica has caught up, promote the new replica as primary and failover production traffic to it.
We discuss each of these steps in more detail in the next sections.
Creating the Intermediate PostgreSQL Server
We started the process by installing a Percona Distribution for PostgreSQL instance in a separate server that could access both the S3 bucket and the target DBaaS instance. For this, we configured the server with the Percona repository and followed the installation instructions. We made sure to use the same major PostgreSQL release running on the source DBaaS instance and to have PostgreSQL configured accordingly, with custom operational settings (such as work_mem) replicated from the source DBaaS instance’s configurations while adjusting others (such as shared_buffers) to our intermediate server specs. Even though this server would be theoretically holding production traffic for only a brief amount of time you should make it at least as capable as the current production instance. Three other settings you should make sure to adjust correctly are:
- hot_standby = ‘on’
- wal_level = ‘logical’
- listen_address to either 0.0.0.0 or the IP address corresponding to the network interface that can reach the target DBaaS instance.
You should also make sure that any related firewalls allow communication between the parts, including pg_hba.conf files; the intermediate server must be able to access the target DBaaS instance and vice-versa.
Populating the Intermediate Server
We took advantage of having daily binary backups available in the S3 bucket and restored the latest one in the intermediate instance. For this, we used wal-e (though the more modern wal-g should work as well). Their Github page contains simple installation instructions using pip. We ended up using snap instead, which had the binary file installed under /snap/bin/wal-e.
In order to safely store the S3 credentials, we put them under /etc/wal-e.d/env and used envdir (from the daemontools package) to pass them along to wal-e. To validate that we were able to access the storage we tried listing the available backups with the following command:
$ envdir /etc/wal-e.d/env /snap/bin/wal backup-list
We then proceeded with the first step on populating the intermediate server, which was to restore the latest available backup. Make sure PostgreSQL is stopped, make a copy of the configuration files (if necessary) and clean up the data directory (in this example, /var/lib/postgresql/11/main) before you proceed with the fetching of the backup:
$ envdir /etc/wal-e.d/env /snap/bin/wal-e backup-fetch --blind-restore /var/lib/postgresql/11/main/ LATEST
If the backup includes configuration files inside the data directory, you may need to adjust those. Pay special attention to the recovery.conf file: it should indicate the server will run in standby mode and how it can access WAL files to catch up with the primary, which in our case are also available in the S3 bucket. Here’s how our recovery.conf file looked like:
standby_mode = on
restore_command = 'envdir /etc/wal-e.d/env /snap/bin/wal-e wal-fetch %f %p'
With that file in place, you can now start PostgreSQL on the intermediate server again: considering all things go well, it should run and process all WAL files available until it has consumed them all, and then stop again. Check the PostgreSQL error log / syslog to certify that was the case.
Switching Over Traffic to the Intermediate Server
When the archive_timeout variable is set to a value greater than zero in the source DBaaS server it may trigger the creation of a new WAL file every archive_timeout seconds when the server is under low (write) activity and not naturally generating WAL segments during this period. This is a good hint to understand the rate at which WAL segments are produced and can thus be consumed by a standby server such as our intermediate PostgreSQL instance. When planning a maintenance window to switch over the traffic to the intermediate server, try to have it as in sync with the primary as possible, which you can accomplish by keeping the recovery.conf file in place and starting PostgreSQL again.
Once the time to switchover arrives, terminate all existing connections to the primary, make sure no new ones are created and observe the server activity for a few seconds. When in doubt, you may create some artificial load to force to produce new WAL segment(s), such as with:
create table xlog_switch as select '0123456789ABCDE' as col1 from generate_series(1,1000000);
drop table xlog_switch;
Start PostgreSQL one more time in the intermediate server with the recovery.conf file still in place to consume the latest WAL segments. When it stops, delete or rename the recovery.conf file and start it one last time. Make sure everything looks alright then promote is as primary:
$ pg_ctl -D /var/lib/postgresql/11/main promote
You can validate the promotion worked by checking the server is no longer in recovery mode:
You can now redirect your database requests to the intermediate server.
Setting up the new primary as a publisher
Logical replication in PostgreSQL uses a publish and subscribe model so you must configure a publisher covering all the tables in the database(s) you want to have replicated in the target DBaaS. This process may be as simple as connecting to the target database (percona in our example here) and creating a publisher (which we will call percpub):
CREATE PUBLICATION percpub FOR ALL TABLES;
An important point regarding logical replication performance: there must be a way to uniquely identify each tuple to which changes are being applied, ideally through a Primary Key or Unique Key (NOT NULL). When a target table lack either of these, REPLICA IDENTITY will have to be set to FULL and the entire tuple composition will be used to identify it, which makes the comparison process more expensive, incurring higher overhead.
Configuring the Target DBaaS as a Replica
First, make sure that this instance is also configured with wal_level = ‘logical’ or you won’t see any replication taking place, which is not as evident a requirement as one would expect. The changing of this setting is accomplished in different ways according to the provider involved.
Besides tuning other PostgreSQL settings of the target DBaaS accordingly, create the target database and a user that can access it. Once that is done, you’ll be able to build the schema and schema objects of the target database by taking a pg_dump with the -s option (“dump only the schema, no data“) from the intermediate server and restoring it on the target DBaaS instance. This can be accomplished with a single command that encapsulates both actions, which should be run from the intermediate server:
$ pg_dump -h localhost -U postgres -d percona -p 5432 -Fc -s |pg_restore -d percona -h your.target.dbaas.com -p 5432 -U postgres
Make sure to adjust the values for database, hostname, and user both in the command above and in the one that follows below.
Once the schema is completed you can create the subscription:
CREATE SUBSCRIPTION percsub CONNECTION 'host=intermediate.server.com dbname=percona user=postgres password=secret port=5432' PUBLICATION percpub;
This should do the trick; the target DBaaS instance will start initial data copy and then sync all changes from the intermediate server. You may follow its advancements by querying:
select * from pg_stat_replication;
A couple of important considerations:
- Beware that it may be difficult to create a logical replication slot on a busy, high-transaction server. You may want to do it immediately after promoting the intermediate server as master and before the application starts connecting to it. This cannot be done earlier because logical decoding is currently not possible on a standby server.
- Logical replication does not replicate DDLs; make sure you freeze any changes in the database schema until you’ve completed the migration.
Once the target DBaaS has caught up with the primary you can plan a final switchover by promoting this replica as a master and redirecting production traffic to it.
And that’s it: we were able to complete the migration following the above steps. We hope you will find them useful as well; you may let us know in the comments section below.