This blog is the result of my investigation into column-wise tables. Hopefully, you will find it useful in your deliberations.
When I started writing it was meant to be a “Good News” blurb. I was pretty optimistic that I’d truly have impressive numbers to share but as it turns out, while there is a potential for significant advantage, a detailed testing regime of the CITUS columnar store extension should be carried out before implementing them into your own environment.
Sizzle: A phrase used to show affirmation or approval. It references the sound that one hears when cooking certain delicious foods i.e. bacon or fajitas and therefore transfers those feelings to non-edible objects or ideas.
There’s a lot to take in and I’m often impressed with every new version of PostgreSQL that is released. Nevertheless, there’s been a long-standing feature that, to be honest, I’ve always felt was missing. This weakness is being addressed by the industry. Ironically, some of these implementations are in fact Postgres derivatives.
What is this sizzling feature that I speak of you ask?
Column-wise tables!
Consider using the CITUS Columnar Extension under one or more of the following conditions:
CAVEAT: You cannot perform UPDATE, DELETE operations on a columnar table.
Let’s get back to basics. In the database world there are essentially two types of database operations:
As an RDBMS, PostgreSQL is geared more toward OLTP than OLAP operations. Manipulating data by inserting, updating, deleting, etc., it does well. But while eminently capable of performing OLAP, it’s not quite as efficient. The primary reason is actually a pretty common characteristic among most RDBMS i.e. it stores and processes collections of data as individual records, otherwise known as row-wise processing.
Suppose you are performing a deep-dive analysis involving a table of records containing 50 columns. And further, suppose that your query only needs a single column. It still needs to read all of the data from those 50 columns per row in order to access that single column. And if the query processes 100 million rows, that’s definitely not trivial!
Now let’s consider a reorientation of the table i.e. column-wise. In this case, the query only reads one column and not 50. The result is that the query is now much lighter, requiring less IO and processing power yet achieving the same result in significantly less time.
As a General Purpose Database Management System, basic behaviors can be reconfigured in Postgres for different purposes. And as such, it is possible to enhance PostgreSQL with columnar tables by using the CITUS columnar table extension.
The CITUS columnar extension is just one part of a larger set of capabilities of this extension that when fully implemented creates a fully scalable distributed Postgres database system.
The CITUS columnar extension feature set includes:
The complete CITUS feature set which, except for the Columnar storage component, is not covered in this blog, includes:
This is a bit of a bad-news, good-news, excellent-news situation.
BAD-NEWS: At the time of writing this blog, the only publicly available packages on the Postgres repositories that I could find were the cstore foreign data wrappers which cover only version 12 and older versions of Postgres.
GOOD-NEWS: The CITUS extension is available on the Citus Data download page.
EXCELLENT-NEWS: As part of the process of writing this blog custom, DEB and RPM packages were created for PostgreSQL version 15 for CENTOS 7, 8, and Ubuntu 18.04, 20.04 and which are available for download from github HERE.
Assuming you opted for the easy way, installing the packages made for this blog:
|
1 |
# Ubuntu 18.04<br>wget https://github.com/rbernierZulu/columnar-tables-extension/blob/main/postgresql-15-citus-columnar_11.1-1-UBUNTU1804_amd64.deb<br><br># Ubuntu 20.04<br>wget https://github.com/rbernierZulu/columnar-tables-extension/blob/main/postgresql-15-citus-columnar_11.1-1-UBUNTU2004_amd64.deb<br><br># Centos 7<br>wget https://github.com/rbernierZulu/columnar-tables-extension/blob/main/postgresql-15-citus_columnar-11.1-1-CENTOS7.x86_64.rpm<br><br># Centos 8<br>wget https://github.com/rbernierZulu/columnar-tables-extension/blob/main/postgresql-15-citus_columnar-11.1-1-CENTOS8.x86_64.rpm |
|
1 |
# Ubuntu 18.04<br>apt update<br>apt-get install ./postgresql-15-citus-columnar_11.1-1-UBUNTU1804_amd64.deb<br><br># Ubuntu 20.04<br>apt update<br>apt-get install ./postgresql-15-citus-columnar_11.1-1-UBUNTU2004_amd64.deb<br><br># Centos 7<br># ATTENTION: the epel package must be installed beforehand!<br>yum install epel-release<br>yum install ./postgresql-15-citus_columnar-11.1-1-CENTOS7.x86_64.rpm<br><br># Centos 8<br>dnf -qy module disable postgresql<br>dnf install ./postgresql-15-citus_columnar-11.1-1-CENTOS7.x86_64.rpm<br>dnf check-update<br> |
What follows here are the results of my analysis. Of course, there’s always more that can be said. Hopefully, this will give you enough of an overview of the possibilities of this extension.
Installing the extension into your database is pretty much standard fare:
|
1 |
-- create your extension in the database:<br>create database db01;<br>c db01<br>create extension citus_columnar; |
Here are two tables, of type HEAP and COLUMNAR, that will be used for the initial investigations. You’ll notice that it took less time to create the regular HEAP accessed table than the columnar table:
|
1 |
drop table if exists t1,t2;<br><br>timing<br><br>-- Time: 7628.612 ms (00:07.629)<br>create table if not exists t1(id,qty) <br>using heap <br>as <br>select (random()*10)::int, (random()*1000)::int from generate_series(1,10e6);<br><br>-- Time:15635.768 ms (00:15.636)<br>create table if not exists t2(id,qty) <br>using columnar <br>as <br>select (random()*10)::int, (random()*1000)::int from generate_series(1,10e6); |
|
1 |
Table "public.t1"<br>Column | Type | Collation | Nullable | Default <br>--------+---------+-----------+----------+---------<br>id | integer | | | <br>qty | integer | | | <br>Access method: heap<br><br> Table "public.t2"<br>Column | Type | Collation | Nullable | Default <br>--------+---------+-----------+----------+---------<br>id | integer | | | <br>qty | integer |<br>Access method: columnar<br> |
The number of records is 10 million:
|
1 |
with a(row_wise) as (select count(*) from t1),<br> b(column_wise) as (select count(*) from t2)<br>select row_wise,column_wise from a,b;<br><br> row_wise | column_wise<br>----------+-------------<br> 9900001 | 9900001 |
This is very cool, the columnar compression does a great job of shrinking the table:
|
1 |
dt+ t[12]<br> List of relations<br> Schema | Name | Type | Owner | Persistence | Access method | Size |<br>--------+------+-------+----------+-------------+---------------+--------+<br> public | t1 | table | postgres | permanent | heap | 346 MB |<br> public | t2 | table | postgres | permanent | columnar | <span style="color: #ff0000;">27 MB</span> |<br> |
Let’s begin by comparing basic administration and SELECT statements of a HEAP vs COLUMNAR table.
Examining the tabulated results you’ll see that much of the performance times indicate that columnar tables either perform, at best, similarly to that of a HEAP table but most of the time they take more time executing the same operations.
Using a psql session, the following SQL statements are executed and examined for performance differences:
SQL | Timings |
— HEAP TABLE | 7.6s 15.6s |
— COLUMNAR TABLE | |
— HEAP TABLE, adding 5 million records | 13.7s 18.5s |
— COLUMNAR TABLE, adding 5 million records | |
— HEAP TABLE | 4.9s 7.8s |
— HEAP TABLE |
Using the aforementioned table definitions, the following metrics were generated with the runtime parameter max_parallel_workers_per_gather = 4.
It’s quite evident that, at least for these two tables, there’s no performance benefit of a columnar accessed table over a regular heap accessed one:
SQL | Timings |
— HEAP TABLE | 742.411 ms 914.096 ms |
— COLUMNAR TABLE | |
— HEAP TABLE | 6441.425 ms 5871.620 ms |
— COLUMNAR TABLE | |
— HEAP TABLE | 329.562 ms 902.614 ms |
— COLUMNAR TABLE | |
— HEAP TABLE | 531.525 ms 1602.756 ms |
— COLUMNAR TABLE |
In order to get a better idea of performance differences, a second set of tables at a greater scale were created. However, this time, while the number of records was halved, the number of columns was increased from two to one hundred.
Even if most of the columns are simply copies of one another, the columnar table’s resultant compression is remarkable as the default size is reduced by a factor of 752X.
|
1 |
/* TABLES<br><br> Table "public.t[34]"<br> Column | Type | Collation | Nullable | Default<br>--------+--------+-----------+----------+----------------------------------------------------------------------------<br> c1 | bigint | | not null | nextval('t1_c1_seq'::regclass)<br> c2 | bigint | | |<br> c3 | text | | | 'aowjfa fawjfawofjawofjawoifawevvaerarpfjkaofvaweawe[OJARGOIAJOAFWF'::text<br> c4 | text | | | 'aowjfa fawjfawofjawofjawoifawevvaerarpfjkaofvaweawe[OJARGOIAJOAFWF'::text<br>.<br>.<br>.<br> c98 | text | | | 'aowjfa fawjfawofjawofjawoifawevvaerarpfjkaofvaweawe[OJARGOIAJOAFWF'::text<br> c99 | text | | | 'aowjfa fawjfawofjawofjawoifawevvaerarpfjkaofvaweawe[OJARGOIAJOAFWF'::text<br> c100 | text | | | 'aowjfa fawjfawofjawofjawoifawevvaerarpfjkaofvaweawe[OJARGOIAJOAFWF'::text<br>Indexes:<br> "t1_pkey" PRIMARY KEY, btree (c1)<br> "t1_c2_idx" btree (c2)<br><br>List of relations<br> Schema | Name | Type | Owner | Persistence | Access method | Size | Description<br>--------+------+-------+----------+-------------+---------------+-------+-------------<br> public | t3 | table | postgres | permanent | heap | 67 GB |<br> public | t4 | table | postgres | permanent | columnar | 89 MB |<br>*/ |
Examining the indexes one sees them to be the same size.
|
1 |
List of relations<br>Schema | Name | Type | Owner | Table | Persistence | Access method | Size |<br>-------+---------------+-------+----------+-------+-------------+---------------+--------+<br>public | t3_c2_idx | index | postgres | t3 | permanent | btree | 105 MB | <br>public | t3_c2_idx1 | index | postgres | t3 | permanent | btree | 105 MB | <br>public | t3_length_idx | index | postgres | t3 | permanent | btree | 33 MB | <br>public | t3_pkey | index | postgres | t3 | permanent | btree | 107 MB | <br>public | t4_c2_idx | index | postgres | t4 | permanent | btree | 105 MB | <br>public | t4_length_idx | index | postgres | t4 | permanent | btree | 33 MB | <br>public | t4_pkey | index | postgres | t4 | permanent | btree | 107 MB | |
Unlike the first set of query plans, these ones clearly demonstrate a significant performance improvement.
Curious to see what would change in the way of performance, the varying the max_parallel_workers_per_gather doesn’t appear to have changed much.
SQL | Timings | |
— HEAP TABLE without index | 9.6s 590.176ms | 8.7s 596.459ms |
— COLUMNAR TABLE without index | ||
— HEAP TABLE | 10.4s 509.209 ms | 8.8s 541.452ms |
— COLUMNAR TABLE | ||
— HEAP TABLE | 1m34s 1.1s | 1m17s 1.2s |
— COLUMNAR TABLE | ||
— HEAP TABLE | 1m33s 1.2s | 1m18s 1.2s |
— COLUMNAR TABLE | ||
General observations: Btree indexes are similar in size between HEAP and COLUMNAR tables. Overall their performance also appears similar although the columnar table’s index is somewhat slower, no doubt due to the extra processing required to uncompress the table’s values.
Regarding Expression Indexes: Creating an EXPRESSION index on COLUMNAR table is significantly faster than that of HEAP:
|
1 |
-- 1m17s<br>create index on t3(length(c90));<br>-- 14s<br>create index on t4(length(c90)); |
Regarding Runtime Parameter max parallel workers: Index performance varies considerably on HEAP tables depending upon the value set to max parallel workers.
The following results highlight that, depending upon the type of table used, it can become important when hardware resource and server costs are a consideration:
SQL | Timings 4 1 | |
— HEAP TABLE using BTREE index | 467.789ms 561.522 ms | 748.939ms 599.629ms |
— COLUMNAR TABLE using BTREE index | ||
— HEAP TABLE using EXPRESSION index | 1.614ms 31.980ms | 2.346ms 38.766ms |
— COLUMNAR TABLE using EXPRESSION index | ||
Overall, indexes, constraints, and access methods are still evolving with many of the features still to be implemented.
Let’s start with a big issue i.e. DELETE and UPDATE:
|
1 |
-- fails<br>delete from t2 where id=5; <br>ERROR: UPDATE and CTID scans not supported for ColumnarScan<br><br>-- fails<br>update t2 set id=5; <br>ERROR: UPDATE and CTID scans not supported for ColumnarScan |
Creating indexes on a columnar table is restricted to btree indexes:
|
1 |
-- works<br>create index on t2 using btree (id);<br><br>-- fails<br>create index on t2 using columnar (id);<br>ERROR: unsupported access method for the index on columnar table t2 |
Creating foreign key constraints aren’t implemented:
|
1 |
select generate_series as id into t3 from generate_series(0,15);<br>alter table t3 add primary key(id);<br><br>-- works for our standard table t1<br>alter table t1 add foreign key(id) references t3(id);<br><br>-- fails with the columnar table t2<br>alter table t2 add foreign key(id) references t3(generate_series);<br>ERROR: Foreign keys and AFTER ROW triggers are not supported for columnar tables<br>HINT: Consider an AFTER STATEMENT trigger instead.<br><br>--works after converting table t1 from COLUMNAR to HEAP<br>alter table t2 set access method heap;<br>alter table t2 add foreign key(id) references t3(generate_series);<br>alter table t2 set access method columnar; |
Columnar tables can be used as partitions; a partitioned table can be made up of any combination of row and columnar partitions.
An excellent use case is INSERT once and READ only table partitions where one can leverage both its compression and better performing OLAP type queries for very large tables.
Columnar Extension Limitations, as of version 11.1:
There’s actually more documented. Refer here for more information.
AUTHOR’S NOTE: In regard to the custom packages created for this blog. The entire citus suite is designed to enhance Postgres as a distributed database solution. Only the columnar table component was covered in this blog. The complete feature set is part of this package install and should work but has not been tested.
Percona and PostgreSQL work better together. Try Percona Distribution for PostgreSQL today.
Despite its current limitations, there are use cases where this extension can definitely make a difference. And it speaks well of its future as the team continues development by constantly improving and adding capabilities. Watch for updates on its GitHub source repository.
Resources
RELATED POSTS