Benchmark ClickHouse Database and clickhousedb_fdw

In this research,  I wanted to see what kind of performance improvements could be gained by using a ClickHouse data source rather than PostgreSQL. Assuming that I would see performance advantages with using ClickHouse, would those advantages be retained if I access ClickHouse from within postgres using a foreign data wrapper (FDW)? The FDW in question is clickhousedb_fdw – an open source project from Percona!

The database environments under scrutiny are PostgreSQL v11, clickhousedb_fdw and a ClickHouse database. Ultimately, from within PostgreSQL v11, we are going to issue various SQL queries routed through our clickhousedb_fdw to the ClickHouse database. Then we’ll see how the FDW performance compares with those same queries executed in native PostgreSQL and native ClickHouse.

Clickhouse Database

ClickHouse is an open source column based database management system which can achieve performance of between 100 and 1000 times faster than traditional database approaches, capable of processing more than a billion rows in less than a second.

Clickhousedb_fdw

clickhousedb_fdw, a ClickHouse database foreign data wrapper, or FDW, is an open source project from Percona. Here’s a link for the GitHub project repository:

https://github.com/Percona-Lab/clickhousedb_fdw

I wrote a blog in March which tells you more about our FDW: https://www.percona.com/blog/2019/03/29/postgresql-access-clickhouse-one-of-the-fastest-column-dbmss-with-clickhousedb_fdw/

As you’ll see, this provides for an FDW for ClickHouse that allows you to SELECT from, and INSERT INTO, a ClickHouse database from within a PostgreSQL v11 server.

The FDW supports advanced features such as aggregate pushdown and joins pushdown. These significantly improve performance by utilizing the remote server’s resources for these resource intensive operations.

Benchmark environment

  • Supermicro server:
    • Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz
    • 2 sockets / 28 cores / 56 threads
    • Memory: 256GB of RAM
    • Storage: Samsung  SM863 1.9TB Enterprise SSD
    • Filesystem: ext4/xfs
  • OS: Linux smblade01 4.15.0-42-generic #45~16.04.1-Ubuntu
  • PostgreSQL: version 11

Benchmark tests

Rather than using some machine generated dataset for this benchmark, we used the “On Time Reporting Carrier On-Time Performance” data from 1987 to 2018. You can access the data using our shell script available here:

https://github.com/Percona-Lab/ontime-airline-performance/blob/master/download.sh

The size of the database is 85GB, providing a single table of 109 columns.

Benchmark Queries

Here are the queries I used to benchmark the ClickHouse, clickhousedb_fdw, and PostgreSQL.

Q#Query Contains Aggregates and Group By
Q1SELECT DayOfWeek, count(*) AS c FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;
Q2SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;
Q3SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10;
Q4SELECT Carrier, count(*) FROM ontime WHERE DepDelay>10 AND Year = 2007 GROUP BY Carrier ORDER BY count(*) DESC;
Q5SELECT a.Carrier, c, c2, c*1000/c2 as c3 FROM ( SELECT Carrier, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year=2007 GROUP BY Carrier ) a
INNER JOIN (  SELECT   Carrier,count(*) AS c2 FROM ontime WHERE Year=2007 GROUP BY Carrier)b on a.Carrier=b.Carrier ORDER BY c3 DESC;
Q6SELECT a.Carrier, c, c2, c*1000/c2 as c3 FROM ( SELECT Carrier,  count(*) AS c FROM ontime  WHERE DepDelay>10 AND Year >= 2000 AND
Year <= 2008 GROUP BY Carrier) a INNER JOIN ( SELECT  Carrier, count(*) AS c2 FROM ontime  WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier ) b on a.Carrier=b.Carrier ORDER BY c3 DESC;
Q7SELECT Carrier, avg(DepDelay) * 1000 AS c3 FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier;
Q8SELECT Year, avg(DepDelay) FROM ontime GROUP BY Year;
Q9select Year, count(*) as c1 from ontime group by Year;
Q10SELECT avg(cnt) FROM (SELECT Year,Month,count(*) AS cnt FROM ontime WHERE DepDel15=1 GROUP BY Year,Month) a;
Q11select avg(c1) from (select Year,Month,count(*) as c1 from ontime group by Year,Month) a;
Q12SELECT OriginCityName, DestCityName, count(*) AS c FROM ontime GROUP BY OriginCityName, DestCityName ORDER BY c DESC LIMIT 10;
Q13SELECT OriginCityName, count(*) AS c FROM ontime GROUP BY OriginCityName ORDER BY c DESC LIMIT 10;
Query Contains Joins
Q14SELECT a.Year, c1/c2 FROM ( select Year, count(*)*1000 as c1 from ontime WHERE DepDelay>10 GROUP BY Year) a
INNER JOIN (select Year, count(*) as c2 from ontime GROUP BY Year ) b on a.Year=b.Year ORDER BY a.Year;
Q15SELECT a.”Year”, c1/c2 FROM ( select “Year”, count(*)*1000 as c1 FROM fontime WHERE “DepDelay”>10 GROUP BY “Year”) a INNER JOIN (select “Year”, count(*) as c2
FROM fontime GROUP BY “Year” ) b on a.”Year”=b.”Year”;

Table-1: Queries used in benchmark

Query executions

Here are the results from the each of the queries when run in different database set ups: PostgreSQL with and without indexes, native ClickHouse and clickhousedb_fdw. The time is shown in milliseconds. 

Q#PostgreSQLPostgreSQL (Indexed)ClickHouseclickhousedb_fdw
Q127920196342357
Q235124173015080
Q3340461561867115
Q43163276672537
Q54722089762760
Q6582332436855153
Q730566132565291
Q83830960511112179
Q920674379793181
Q10349902010256148
Q11304895165837155
Q1239357337421861333
Q132991230709101384
Q1454126399131241364212
Q159725830211245259

Table-1: Time taken to execute the queries used in benchmark

Reviewing the results

The graph shows the query execution time in milliseconds, the X-axis shows the query number from the tables above, while the Y-axis shows the execution time in milliseconds. The results for ClickHouse and the data accessed from postgres using clickhousedb_fdw are shown. From the table, you can see there is a huge difference between PostgreSQL and ClickHouse, but there is minimal difference between ClickHouse and clickhousedb_fdw.

 

Clickhouse Vs Clickhousedb_fdw (Shows the overhead of clickhousedb_fdw)

This graph shows the difference between ClickhouseDB and clickhousedb_fdw. In most of the queries, the FDW overhead is not that great, and barely significant apart from in Q12. This query involves joins and ORDER BY clause. Because of the ORDER BY clause the GROUP/BY and ORDER BY does not push down to ClickHouse.

In Table-2 we can see the spike of time in query Q12 and Q13. To reiterate, this is caused by the ORDER BY clause. To confirm this, I ran a queries Q-14 and Q-15 with and without the ORDER BY clause.  Without the ORDER BY clause the completion time is  259ms and with ORDER BY clause, it is 1364212. To debug that the query I explain both the queries, and here is the results of explain.

 

 

 

Conclusion

The results from these experiments show that ClickHouse offers really good performance, and clickhousedb_fdw offers the benefits of ClickHouse performance from within PostgreSQL. While there is some overhead when using clickhousedb_fdw, it is negligible and is comparable to the performance achieved when running natively within the ClickHouse database. This also confirms that the PostgreSQL foreign data wrapper push-down feature provides wonderful results.

Share this post

Comments (3)

  • Mark Callaghan Reply

    Is the data cached by the DBMS or OS?
    Is parallel query in Postgres an option for these queries?

    May 1, 2019 at 7:26 pm
  • Harris Reply

    Somehow my setup on centos return constant zero cost on foreign scan that doesn’t push down the aggregation. Any hint ?

    June 4, 2019 at 4:45 am
    • Ibrar Ahmed Reply

      Can you please create an issue with complete detail, I will look at this.

      June 4, 2019 at 5:18 am

Leave a Reply