Nested Data Structures in ClickHouse

In this blog post, we’ll look at nested data structures in ClickHouse and how this can be used with PMM to look at queries.

Nested structures are not common in Relational Database Management Systems. Usually, it’s just flat tables. Sometimes it would be convenient to store unstructured information in structured databases.

We are working to adapt ClickHouse as a long term storage for Percona Monitoring and Management (PMM), and particularly to store detailed information about queries. One of the problems we are trying to solve is to count the different errors that cause a particular query to fail.

For example, for date 2017-08-17 the query:

 "SELECT foo FROM bar WHERE id=?"

1	"SELECT foo FROM bar WHERE id=?"

was executed 1000 times. 25 times it failed with error code “1212”, and eight times it failed with error code “1250”. Of course, the traditional way to store this in relational data would be to have a table "Date, QueryID, ErrorCode, ErrorCnt" and then perform a JOIN to this table. Unfortunately, columnar databases don’t perform well with multiple joins, and often the recommendation is to have de-normalized tables.

We can create a column for each possible ErrorCode, but this is not an optimal solution. There could be thousands of them, and most of the time they would be empty.

In this case, ClickHouse proposes Nested data structures. For our case, these can be defined as:

CREATE TABLE queries
(
    Period Date,
    QueryID UInt32,
    Fingerprint String,
    Errors Nested
    (
        ErrorCode String,
        ErrorCnt UInt32
    )
)Engine=MergeTree(Period,QueryID,8192);

CREATE TABLE queries

(

Period Date,

QueryID UInt32,

Fingerprint String,

Errors Nested

(

ErrorCode String,

ErrorCnt UInt32

)

)Engine=MergeTree(Period,QueryID,8192);

This solution has obvious questions: How do we insert data into this table? How do we extract it?

Let’s start with INSERT. Insert can look like:

INSERT INTO queries VALUES ('2017-08-17',5,'SELECT foo FROM bar WHERE id=?',['1220','1230','1212'],[5,6,2])

1	INSERT INTO queries VALUES ('2017-08-17',5,'SELECT foo FROM bar WHERE id=?',['1220','1230','1212'],[5,6,2])

which means that the inserted query during 2017-08-17 gave error 1220 five times, error 1230 six times and error 1212 two times.

Now, during a different date, it might produce different errors:

INSERT INTO queries VALUES ('2017-08-18',5,'SELECT foo FROM bar WHERE id=?',['1220','1240','1258'],[3,2,1])

1	INSERT INTO queries VALUES ('2017-08-18',5,'SELECT foo FROM bar WHERE id=?',['1220','1240','1258'],[3,2,1])

Let’s take a look at ways to SELECT data. A very basic SELECT:

SELECT *
FROM queries 

┌─────Period─┬─QueryID─┬─Fingerprint─┬─Errors.ErrorCode───────┬─Errors.ErrorCnt─┐
│ 2017-08-17 │       5 │ SELECT foo  │ ['1220','1230','1212'] │ [5,6,2]         │
│ 2017-08-18 │       5 │ SELECT foo  │ ['1220','1240','1260'] │ [3,16,12]       │
└────────────┴─────────┴─────────────┴────────────────────────┴─────────────────┘

SELECT *

FROM queries

┌─────Period─┬─QueryID─┬─Fingerprint─┬─Errors.ErrorCode───────┬─Errors.ErrorCnt─┐

│ 2017-08-17 │ 5 │ SELECT foo │ ['1220','1230','1212'] │ [5,6,2] │

│ 2017-08-18 │ 5 │ SELECT foo │ ['1220','1240','1260'] │ [3,16,12] │

└────────────┴─────────┴─────────────┴────────────────────────┴─────────────────┘

If we want to use a more familiar tabular output, we can use the ARRAY JOIN extension:

SELECT *
FROM queries 
ARRAY JOIN Errors

┌─────Period─┬─QueryID─┬─Fingerprint─┬─Errors.ErrorCode─┬─Errors.ErrorCnt─┐
│ 2017-08-17 │       5 │ SELECT foo  │ 1220             │            5    │
│ 2017-08-17 │       5 │ SELECT foo  │ 1230             │            6    │
│ 2017-08-17 │       5 │ SELECT foo  │ 1212             │            2    │
│ 2017-08-18 │       5 │ SELECT foo  │ 1220             │            3    │
│ 2017-08-18 │       5 │ SELECT foo  │ 1240             │           16    │
│ 2017-08-18 │       5 │ SELECT foo  │ 1260             │           12    │
└────────────┴─────────┴─────────────┴──────────────────┴─────────────────┘

SELECT *

FROM queries

ARRAY JOIN Errors

┌─────Period─┬─QueryID─┬─Fingerprint─┬─Errors.ErrorCode─┬─Errors.ErrorCnt─┐

│ 2017-08-17 │ 5 │ SELECT foo │ 1220 │ 5 │

│ 2017-08-17 │ 5 │ SELECT foo │ 1230 │ 6 │

│ 2017-08-17 │ 5 │ SELECT foo │ 1212 │ 2 │

│ 2017-08-18 │ 5 │ SELECT foo │ 1220 │ 3 │

│ 2017-08-18 │ 5 │ SELECT foo │ 1240 │ 16 │

│ 2017-08-18 │ 5 │ SELECT foo │ 1260 │ 12 │

└────────────┴─────────┴─────────────┴──────────────────┴─────────────────┘

However, usually we want to see the aggregation over multiple periods, which can be done with traditional aggregation functions:

SELECT 
    QueryID,
    Errors.ErrorCode,
    SUM(Errors.ErrorCnt)
FROM queries
ARRAY JOIN Errors
GROUP BY 
    QueryID,
    Errors.ErrorCode

┌─QueryID─┬─Errors.ErrorCode─┬─SUM(Errors.ErrorCnt)─┐
│       5 │ 1212             │                 2    │
│       5 │ 1230             │                 6    │
│       5 │ 1260             │                12    │
│       5 │ 1240             │                16    │
│       5 │ 1220             │                 8    │
└─────────┴──────────────────┴──────────────────────┘

SELECT

QueryID,

Errors.ErrorCode,

SUM(Errors.ErrorCnt)

FROM queries

ARRAY JOIN Errors

GROUP BY

QueryID,

Errors.ErrorCode

┌─QueryID─┬─Errors.ErrorCode─┬─SUM(Errors.ErrorCnt)─┐

│ 5 │ 1212 │ 2 │

│ 5 │ 1230 │ 6 │

│ 5 │ 1260 │ 12 │

│ 5 │ 1240 │ 16 │

│ 5 │ 1220 │ 8 │

└─────────┴──────────────────┴──────────────────────┘

If we want to get really creative and return only one row per QueryID, we can do that as well:

SELECT 
    QueryID, 
    groupArray((ecode, cnt))
FROM 
(
    SELECT 
        QueryID, 
        ecode, 
        sum(ecnt) AS cnt
    FROM queries 
    ARRAY JOIN 
        Errors.ErrorCode AS ecode, 
        Errors.ErrorCnt AS ecnt
    GROUP BY 
        QueryID, 
        ecode
) 
GROUP BY QueryID

┌─QueryID─┬─groupArray(tuple(ecode, cnt))──────────────────────────────┐
│       5 │ [('1230',6),('1212',2),('1260',12),('1220',8),('1240',16)] │
└─────────┴────────────────────────────────────────────────────────────┘

SELECT

QueryID,

groupArray((ecode, cnt))

FROM

(

SELECT

QueryID,

ecode,

sum(ecnt) AS cnt

FROM queries

ARRAY JOIN

Errors.ErrorCode AS ecode,

Errors.ErrorCnt AS ecnt

GROUP BY

QueryID,

ecode

)

GROUP BY QueryID

┌─QueryID─┬─groupArray(tuple(ecode, cnt))──────────────────────────────┐

│ 5 │ [('1230',6),('1212',2),('1260',12),('1220',8),('1240',16)] │

└─────────┴────────────────────────────────────────────────────────────┘

Conclusion

ClickHouse provides flexible ways to store data in a less structured manner and variety of functions to extract and aggregate it – despite being a columnar database.

Happy data warehousing!

8 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Peter Colclough

8 years ago

Been storing this type of data in Elasticsearch for years…..well 3-4 years. It’s fast, flexible, and does not require relational setups. Seriously worth a look, especially with tge machine learning plugin, which can pick up anomalies for you and warn you when they happen. Add the graphing capability of Kibana, and you are set.

Just my viewpoint 🙂

Peter Zaitsev

Admin

Reply to Peter Colclough

8 years ago

For relational data you will find ClickHouse significantly faster. Here are some third party benchmark which compares ClickHouse and Elastic for some SQL queries on the same hardware http://tech.marksblogg.com/benchmarks.html

Denis

Reply to Peter Colclough

8 years ago

> does not require relational setups

at the cost of performance and storage efficiency.

Andy

8 years ago

How do you increment ErrorCnt? Every time when a query fails you want to increment the appropriate ErrorCnt, right? With a nested structure how do you do that?

Vadim Tkachenko

Author

Reply to Andy

8 years ago

Andy,

We do not increment.
We aggregated reports and make a new entry, say, every 5 mins.

That’s why to see the total we need to use the aggregation function (SUM in this case) over multiple entries

-1

Emanuel Calvo

Reply to Andy

8 years ago

Clickhouse does not support update ops, last status is at https://github.com/yandex/ClickHouse/issues/923.

SUNG GON KIM

Reply to Andy

8 years ago

UPDATE and DELETE support is included in the roadmap in Q1 2018

Nishit

6 years ago

Is it possible to achieve the same result (If we want to use a more familiar tabular output, we can use the ARRAY JOIN extension) without array join for nested data column? As array joins are slow so we cannot use them.

MySQL 5.7
Support

Compare Percona to Leading Database Solutions

Software
Downloads

Valkey Contribution

Product Documentation

Resource Hub

Why Percona for MongoDB?

Why Percona for PostgreSQL?

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Nested Data Structures in ClickHouse

Conclusion

Related Blog Articles

RECOMMENDED ARTICLES

Urgent Security Update: Patching “Mongobleed” (CVE-2025-14847) in Percona Server for MongoDB

JavaScript Stored Routines in Percona Server for MySQL: A New Era for Database Programmability

Running Databases on Kubernetes: A Practical Guide to Risks, Benefits, and Best Practices

MOST POPULAR ARTICLES

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL Performance Tuning: Maximizing Database Efficiency and Speed

The Ultimate Guide to Open Source Databases

MySQL 5.7 Support

Compare Percona to Leading Database Solutions

Software Downloads

Valkey Contribution

Product Documentation

Resource Hub

Why Percona for MongoDB?

Why Percona for PostgreSQL?

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Nested Data Structures in ClickHouse

Conclusion

About the Author

Share This Post!

Stay up to date with the Percona Blog

Related Blog Articles

RECOMMENDED ARTICLES

Urgent Security Update: Patching “Mongobleed” (CVE-2025-14847) in Percona Server for MongoDB

JavaScript Stored Routines in Percona Server for MySQL: A New Era for Database Programmability

Running Databases on Kubernetes: A Practical Guide to Risks, Benefits, and Best Practices

MOST POPULAR ARTICLES

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL Performance Tuning: Maximizing Database Efficiency and Speed

The Ultimate Guide to Open Source Databases

MySQL 5.7
Support

Software
Downloads