The Evolution of Object Storage
Let’s start by stating something really obvious; object storage has become the preeminent storage system in the world today. Initially created to satisfy a need to store large amounts of infrequently accessed data, it has since grown to the point of becoming the dominant archival medium for unstructured content. Its importance is such that many traditional data systems are being eclipsed by API driven object storage in cloud-native environments.
As a result, an unintended development is that people have come to appreciate the advantages of not only storing but, more importantly, processing the data stored in object storage containers. And while it can seem extremely advantageous over time achieving profitability becomes more complex, laborious and expensive as the dataset grows. This hodgepodge of data is known as a data swamp. The risk is that this swamp of data can become so disorganized that the cost of processing it exceeds any possible economic benefit. In a word it becomes unusable.
What to do? The solution? Data Lakes!
Contrary to the data swamp, which is a disorganized repository of isolated data sources, the Data Lake is a centralized storage repository capable of archiving extremely large datasets handling both structured and unstructured data coming from various DBMS.
While Data Lakes are a big step forward, it nevertheless requires extensive planning not only for its storage but more importantly accessing and processing the data.
Enter the Data Lakehouse: The data lakehouse leverages the low cost and flexible storage of a data lake and strives to achieve the same data management capabilities and features of a traditional data warehouse. By adding a metadata/transactional layer to the object storage container, not only does the Data Lakehouse approach the equivalent ACID compliance of an RDBMS but it has the added bonus permitting multiple types of database engines that can interact with it.
Working With PostgreSQL And pg_lake
What’s Object Storage Containers got to do with PostgreSQL?
I’m glad you asked!
As you may know, the PostgreSQL extension framework permits one to add new features, data types, functions, operators etc to PostgreSQL as self-contained modules. And in the realm of analytics there is one extension in particular that represents a remarkable paradigm of generating highly prized data insights gleaned from very large datasets i.e. pg_lake.
Leveraging the pg_lake extension makes it possible to manipulate raw data from any source, whether it be structured or unstructured, by transparently accessing the data in a manner that can be read and then be processed in a postgres database.
The challenge is what to do with the resulting insights which in itself becomes a very valuable commodity. While many cloud providers do state they implement encryption at rest, it doesn’t change the fact that its ultimate security is out of your hands. And this leap of faith you may not want to take. So much so that the risk of data corruption, let alone its theft, can affect the company’s bottom line.
Working With pg_tde
Now let’s pivot and talk about security.
These insights, once generated, can be stored and secured in postgres in an encrypted at rest state by the virtue of pg_tde.
Wouldn’t it be nice if we could store the encrypted data on postgres to object storage directly?
What to do?
Summarizing;
- pg_tde: Is the open-source PostgreSQL extension that provides Transparent Data Encryption (TDE), securing sensitive data at rest by automatically encrypting database files on disk and protecting against unauthorized access to its storage.
- pg_lake: Is a set of extensions allowing PostgreSQL to integrate directly with data lakehouses. It enables users to access and query data stored in open formats such as Apache Iceberg, Parquet, CSV, and JSON in cloud based storage.
An Overview Of Features And Limitations
What you can do:
- pg_tde:
- Transparent Encryption: Encrypts data on disk (at rest) at the storage level
- Granular Control: Encrypt specific tables
- Customizable Key Management
- WAL Encryption can be performed
- Replication Support: logical replication can be used
- Multi-Tenancy
- the following WAL CLI are currently supported
- pg_tde_basebackup
- pg_tde_resetwal
- pg_tde_rewind
- pg_tde_waldump
- pgBackRest (partial support; refer to section “what you can’t do” below)
- the following WAL CLI are currently supported
- pg_lake:
- Can perform DML operations against “lake house” structured data
- Can read data from a data lake
- Can read multiple file formats from the “data lake” i.e. Apache Iceberg/Delta Lake/Parquet
- Enables performing analytics on extremely large datasets with external Iceberg and file-backed tables in a single query
What can’t you do:
- pg_tde:
- System tables, catalogs, are currently not encrypted
- Temporary files are not fully encrypted
- The following WAL CLI are currently unsupported
- pg_createsubscriber
- pg_receivewal
- Barman
- pg_verifybackup by default fails, therefore use -s (skip checksum) and -n (–no-parse-wal)
- pgBackRest: the asynchronous archiving feature
- Cannot read/write data directly from cloud storage services
- Encrypt data different than PostgreSQL relations stored on its storage
- pg_lake:
- Requires a CLI interfacing with DuckDB which in turn interacts with the target object storage service such as for example “S3”
- Fine-Grained Access Control is not currently available
- Iceberg v2 protocol is partial:
- not all data types are fully supported
- not all administrative/housekeeping operations on the data lake is fully supported
- Transparent Data Encryption is not possible except within the object storage’s feature set itself if it’s available.
Conclusion, We Need Your Input
While there’s a lot of interest in developing this dynamic duo, the challenge is figuring out where to start.
For example, an easy kludge is to archive encrypted WALs directly to S3 compatible services by leveraging the postgres runtime parameter “archive_command” and using the appropriate CLI storing it to S3 for instance.
Alternatively, as a first step, enable the data that is encrypted at rest to be both written and read by pg_lake.
But as a final solution; we could look at a foreign data wrapper implementation of pg_tde. This FDW could perform operations similar to pg_lake but with the added capability of performing client-side encryption that can write and read postgres pg_tde encrypted data on compatible cloud storage services.
Tell us what you think.
Building Your Own Test Bed Environment
For those wanting to try out these extensions, here’s how you can install everything into your own personal development environment on Linux, you shouldn’t need anything more than 50GB HDD space to download and compile and install.
Download and compile the following:
- pg_tde: The Percona PostgreSQL source code which has pg_tde builtin
- pg_lake: The source code build consumes a lot of CPU, and space
- DuckDB: A fast, open-source, in-process analytical database management system (DBMS)
- MinIO: An open source S3 compatible, object storage system, designed for cloud-native environments
While the aforementioned links do provide sources and instructions you’ll need to note a few “extra” details that are not readily understood:
Regarding downloading Percona pg_tde: Use these commands downloading the source
|
1 2 3 4 |
git clone https://github.com/percona/postgres.git cd postgres git submodule update --init git checkout release-17.5.3 |
Regarding pg_lake: There’s a mistake in the source code, use the following to fix it
|
1 2 3 4 |
# EDIT FILE (~ line 73): vim $HOME/pg_lake/pg_lake_table/src/planner/query_pushdown.c # edit from: #include "server/rewrite/rewriteManip.h" # edit to : #include "rewrite/rewriteManip.h" sed -i -e "s/server/rewrite/rewriteManip.h/rewrite/rewriteManip.h/g" /root/pg_lake/pg_lake_table/src/planner/query_pushdown.c |