Column Store Database Benchmarks: MariaDB ColumnStore vs. Clickhouse vs. Apache Spark

This blog shares some column store database benchmark results, and compares the query performance of MariaDB ColumnStore v. 1.0.7 (based on InfiniDB), Clickhouse and Apache Spark.

I’ve already written about ClickHouse (Column Store database).

The purpose of the benchmark is to see how these three solutions work on a single big server, with many CPU cores and large amounts of RAM. Both systems are massively parallel (MPP) database systems, so they should use many cores for SELECT queries.

For the benchmarks, I chose three datasets:

  1. Wikipedia page Counts, loaded full with the year 2008, ~26 billion rows
  2. Query analytics data from Percona Monitoring and Management
  3. Online shop orders

This blog post shares the results for the Wikipedia page counts (same queries as for the ClickHouse benchmark). In the following posts, I will use other datasets to compare the performance.

Databases, Versions and Storage Engines Tested

  • MariaDB ColumnStore v. 1.0.7, ColumnStore storage engine
  • Yandex ClickHouse v. 1.1.54164, MergeTree storage engine
  • Apache Spark v. 2.1.0, Parquet files and ORC files

Although all of the above solutions can run in a “cluster” mode (with multiple nodes), I’ve only used one server.


This time I’m using newer and faster hardware:

  • CPU: physical = 2, cores = 32, virtual = 64, hyperthreading = yes
  • RAM: 256Gb
  • Disk: Samsung SSD 960 PRO 1TB, NVMe card

Data Sizes

I’ve loaded the above data into Clickhouse, ColumnStore, and MySQL (for MySQL the data included a primary key; Wikistat was not loaded to MySQL due to the size). MySQL tables are InnoDB with a primary key.

Dataset Size (GB)Column StoreClickhouseMySQLSpark / ParquetSpark / ORC file
Wikistat374.24 Gb211.3 Gbn/a (> 2 Tb)395 Gb273 Gb
Query metrics61.23 Gb28.35 Gb520 Gb
Store Orders9.3 Gb4.01 Gb46.55 Gb


Query Performance

Wikipedia page counts queries

Test type (warm)SparkClickhouseColumnStore
Query 1: count(*)5.372.1430.77
Query 2: group by month205.7516.36259.09
Query 3: top 100 wiki pages by hits (group by path)750.35171.221640.7

Test type (cold)SparkClickhouseColumnStore
Query 1: count(*)21.938.01139.01
Query 2: group by month217.8816.65420.77
Query 3: top 100 wiki pages by hits (group by path)887.434182.561703.19

Partitioning and Primary Keys

All of the solutions have the ability to take advantage of data “partitioning” and only scan needed rows.

ClickHouse has “primary keys” (for the MergeTree storage engine) and scans only the needed chunks of data (similar to partition “pruning” in MySQL). No changes to SQL or table definitions is needed when working with ClickHouse.

ClickHouse example:

As we can see here, ClickHouse has processed ~two billion rows for one month of data, and ~23 billion rows for ten months of data. Queries that only select one month of data are much faster.

For ColumnStore we need to re-write the SQL query and use “between ‘2008-01-01’ and 2008-01-10′” so it can take advantage of partition elimination (as long as the data is loaded in approximate time order). When using functions (i.e., year(dt) or month(dt)), the current implementation does not use this optimization. (This is similar to MySQL, in that if the WHERE clause has month(dt) or any other functions, MySQL can’t use an index on the dt field.)

ColumnStore example:

Apache Spark does have partitioning, however. It requires the use of partitioning with parquet format in the table definition. Without declaring partitions, even the modified query (“select count(*), month(date) as mon from wikistat where date between ‘2008-01-01’ and ‘2008-01-31’ group by mon order by mon”) will have to scan all the data.

The following table and graph show the performance of the updated query:

Test type / updated querySparkClickhouseColumnStore
group by month, one month, updated syntax205.750.9312.46
group by month, ten months, updated syntax205.758.84170.81


Working with Large Datasets

With 1Tb uncompressed data, doing a “GROUP BY” requires lots of memory to store the intermediate results (unlike MySQL, ColumnStore, ClickHouse, and Apache Spark use hash tables to store groups by “buckets”). For example, this query requires a very large hash table:

As “path” is actually a URL (without the hostname), it takes a lot of memory to store the intermediate results (hash table) for GROUP BY.

MariaDB ColumnStore does not allow us to “spill” data on disk for now (only disk-based joins are implemented). If you need to GROUP BY on a large text field, you can decrease the disk block cache setting in Columnstore.xml (i.e., set disk cache to 10% of RAM) to make room for an intermediate GROUP BY:

In addition, as the query has an ORDER BY, we need to increase max_length_for_sort_data in MySQL:

SQL Support

INSERT … VALUES✅ yes✅ yes✅ yes
UPDATE❌ no❌ no✅ yes
DELETE❌ no❌ no✅ yes
ALTER … change paritions✅ yes✅ yes✅ yes
SELECT with WINDOW functions✅ yes❌ no✅ yes


*Spark does not support UPDATE/DELETE. However, Hive supports ACID transactions with UPDATE and DELETE statements. BEGIN, COMMIT, and ROLLBACK are not yet supported (only the ORC file format is supported).

ColumnStore is the only database out of the three that supports a full set of DML and DDL (almost all of MySQL’s implementation of SQL is supported).

Comparing ColumnStore to Clickhouse and Apache Spark

 Solution Advantages Disadvantages
MariaDB ColumnStore
  • MySQL frontend (make it easy to migrate from MySQL)
  • UPDATE and DELETE are supported
  • Window functions support
  • Select queries are slower
  • No replication from normal MySQL server (planned for the future versions)
  • No support for GROUP BY on disk
Yandex ClickHouse
  • Fastest performance
  • Better compression
  • Primary keys
  • Disk-based GROUP BY, etc.
  • No MySQL protocol support
Apache Spark
  • Flexible storage options
  • Machine learning integration (i.e., pyspark ML libraries run inside spark nodes)
  • No MySQL protocol support
  • Slower select queries (compared to ClickHouse)


Yandex ClickHouse is an absolute winner in this benchmark: it shows both better performance (>10x) and better compression than MariaDB ColumnStore and Apache Spark. If you are looking for the best performance and compression, ClickHouse looks very good.

At the same time, ColumnStore provides a MySQL endpoint (MySQL protocol and syntax), so it is a good option if you are migrating from MySQL. Right now, it can’t replicate directly from MySQL but if this option is available in the future we can attach a ColumnStore replication slave to any MySQL master and use the slave for reporting queries (i.e., BI or data science teams can use a ColumnStore database, which is updated very close to realtime).

Table Structure and List of Queries

Table structure (MySQL / Columnstore version):

Query 1:

Query 2a (full scan):

Query 2b (for partitioning test)

Query 3:


Share this post

Comments (20)

  • Luis Jurado

    I’ve been looking into different platforms to do analytics and this blog post makes me want to reconsider Clickhouse. What I don’t like about it it’s that apart of Yandex almost no one else is using it yet compared to hadoop based alternatives or MariaDB that I could easily get support in case I would have issues with them.

    Also it would be really cool to see a performance comparison over multiple nodes to compare how well this different systems scale over a cluster.

    March 19, 2017 at 3:13 pm
    • Sergey

      Hello Luis,

      as far as we can see, more than a hundred companies use ClickHouse. To make sure of this, simply join ClickHouse telegram chat or Google group. There you can ask any questions. The community and ClickHouse team responds promptly to them.

      If you still need a support service, please leave your contacts at

      March 20, 2017 at 7:58 am
    • Хехе 0 0

      It is gathering popularity quickly here in Russia. For instance, we were switching to Spark from our legacy statistical system but immediately dumped everything we did after the clickhouse was released:

      1) It is turned to be much quicker
      2) The fact it is server greatly benifits us: free input source split. With spark you either creates a table with many columns which bad for readability and insert statement can be really long, thus error prone. Or parse these sources several times and this can be overly expensive at times. Not a problem with clickhouse.
      3) With clickhouse you don’t just have naturally distributed log parsing. You naturally have continuous data, second by second, minute by minute, day by day available in the single source. With Spark you will struggle with
      4) Clickhouse gives free to use realtime access to collected data. This is really useful in many circumstances. It is a great time saver sometimes.
      5) It is fast as I said. Hadoop is slow to the extent you could need several hosts just to discover you match the speed of relational operations over GNU utils (awk, grep, sort, join) on the single host. Or rather not quite up to that speed. Hadoop is just too slow.

      March 20, 2017 at 2:13 pm
      • Luis Jurado

        Good to see that is getting traction, I couldn’t find many information about people using it but maybe if I would search on yandex I would get better information.

        March 20, 2017 at 4:09 pm
      • borislavmavrin

        I think it unfair to compare db with Spark. Spark is a very general tool. You can do pretty much everything: from data ingestion, cleaning, structuring up to the ML and GraphX modelling and finally streaming, even Natural Language Processing. Don’t forget about BigDL. I also work with highly instructed data. Spark is incredible. Spark is more like a functional programming language at scale. Yes, it is slower, but that is the tradeoff between functionality and speed. Me as a data scientist I don’t see any competitors to Spark.

        Another side note: I don’t know how hard it is to scale clickhouse. I know that mongo requires a lot of engineering in order to scale. As for Spark I can easily install it on cluster myself.

        January 11, 2018 at 10:05 am
        • Alexander Rubin

          Yes, it is a good point: Spark is a more general tool and not *just* MPP database. However, for the purposes of this blog post I wanted to see how fast Spark is able to just process data. If you are using other features of Apache Spark (i.e. ML) – those are of cause not available in Clickhouse and ColumnStore.

          January 11, 2018 at 10:46 am
  • Jacob

    Thank you for very informative article.

    It would be nice if the comparison also included the difficulty of installation, data loading and tuning.

    There is no any mention about tuning. Does it mean that the databases were used “out of the box” with default settings?

    Also, how well MariaDB ColumnStore, ClickHouse and Apache Spark are supported online,
    I mean by Internet users? Could you find answers to your problems on the Internet?

    March 19, 2017 at 5:01 pm
  • james

    thanks a lot

    March 20, 2017 at 6:13 am
  • sanjeev

    Clickhouse has no Update or Delete functionality. It is still super fast, but lack of Update/Delete is a serious limitation for many users.

    May 16, 2017 at 6:33 am
  • Erin ONeill

    I sure hope that Percona can bring ClickHouse into the MySQL protocol so that percona toolkit will work with it, as well as the PMM. Very interesting. (sure wish there was Window functions support as I now have a postgres instance for that!!!?? and sore miss percona toolkit)

    July 10, 2017 at 1:25 pm
  • Tony

    comparing apples to oranges

    August 10, 2017 at 7:08 pm
  • Marcos Albe

    Alex, I would love to see same comparison with Druid and Pinot, which seem to be more in the same league than ClickHouse. Have you considered these two? Any comments on’em?

    April 25, 2018 at 10:06 am
  • orangleliu

    very cool, clickhouse is very fast

    August 29, 2018 at 12:48 am
  • Ivan Zolotuhin

    potentially ClickHouse can be accessible via MySQL protocol using proxysql-clickhouse

    September 11, 2018 at 1:22 pm
  • Alexander Rubin

    As of now Clickhouse also supports UPDATES / DELETES (as a form of “mutations”).

    January 9, 2019 at 8:38 am
  • Eyal Melamed

    for systems as mentioned above, having a lot of data to be added, we are using columnstore as I can load a file with 50K lines into a large fact table seconds. is there any test / comparison for load times? can clickhouse load new data rapidly? for instance if I would like to add 20-50K lines per minute, is it capable of doing those data loads fast enough to avoid delays and locks?

    April 1, 2019 at 11:20 am
    • Alexandre HEIM

      By micro-batching your inserts, you can easily achieve more than 100 000 inserts/s. We did a test on 15 billion records, and we inserted at a constant rate of 250 000 records/s, CH is very fast.

      April 3, 2019 at 2:04 pm
  • patrellery

    Clickhouse supports UPDATE and DELETE, please update

    April 2, 2019 at 10:53 pm

Comments are closed.

Use Percona's Technical Forum to ask any follow-up questions on this blog topic.