Buy Percona ServicesBuy Now!

Updating/Deleting Rows From Clickhouse (Part 2)

 | January 16, 2018 |  Posted In: Insight for DBAs, MySQL, Yandex ClickHouse

PREVIOUS POST
NEXT POST

ClickHouseIn this post, we’ll look at updating and deleting rows with ClickHouse. It’s the second of two parts.

In the first part of this post, we described the high-level overview of implementing incremental refresh on a ClickHouse table as an alternative support for UPDATE/DELETE. In this part, we will show you the actual steps and sample code.

Prepare Changelog Table

First, we create the changelog table below. This can be stored on any other MySQL instance separate from the source of our analytics table. When we run the change capture script, it will record the data on this table that we can consume later with the incremental refresh script:

Create ClickHouse Table

Next, let’s create the target ClickhHouse table. Remember, that the corresponding MySQL table is below:

Converting this table to ClickHouse looks like below, with the addition of a “created_day” column that serves as the partitioning key:

Run Changelog Capture

Once the tables are ready, running the change capture script. An example script can be found in this gist, which is written in Python and uses the python-mysql-replication library. This library acts as replication client, continuously downloads the binary logs from the source and sifts through it to find any UPDATE/DELETE executed against our source table.

There are a few configuration options that need to be customized in the script.

  • LOG_DB_HOST: The MySQL host where we created the clickhouse_changelog table.
  • LOG_DB_NAME: The database name where the clickhouse_changelog table is created.
  • SRC_DB_HOST: The MySQL host where we will be downloading binary logs from. This can either be a primary or secondary/replica as long as its the same server where our raw table is also located.
  • MYSQL_USER: MySQL username.
  • MYSQL_PASS: MySQL password.
  • TABLE: The table we want to watch for changes.

When the script is successfully configured and running, the clickhouse_changelog table should start populating with data like below.

Full Table Import

So we have our changelog capture in place, the next step is to initially populate the ClickHouse table from MySQL. Normally, we can easily do this with a  mysqldump into a tab-separated format, but remember we have to transform the created_at column from MySQL into ClickHouse’s Date format to be used as partitioning key.

A simple way to do this is by using a simple set of shell commands like below:

One thing to note about this process is that the MySQL client buffers the results for the whole query, and it could eat up all the memory on the server you run this from if the table is really large. To avoid this, chunk the table into several million rows at a time. Since we already have the changelog capture running and in place from the previous step, you do not need to worry about any changes between chunks. We will consolidate those changes during the incremental refreshes.

Incremental Refresh

After initially populating the ClickHouse table, we then set up our continuous incremental refresh using a separate script. A template script we use for the table on our example can be found in this gist.

What this script does is twofold:

  • Determines the list of weeks recently modified based on clickhouse_changelog, dump rows for those weeks and re-imports to ClickHouse.
  • If the current week is not on the list of those with modifications, it also checks for new rows based on the auto-incrementing primary key and appends them to the ClickHouse table.

An example output of this script would be:

Note that, on step 4, if you imported a really large table and the changelog had accumulated a large number of changes to refresh, the initial incremental execution might take some time. After that though, it should be faster. This script can be run every minute, longer or shorter, depending on how often you want the ClickHouse table to be refreshed.

To wrap up, here is a query from MySQL on the same table, versus ClickHouse.

Enjoy!

PREVIOUS POST
NEXT POST
Jervin Real

As Senior Consultant, Jervin partners with Percona's customers on building reliable and highly performant MySQL infrastructures while also doing other fun stuff like watching cat videos on the internet. Jervin joined Percona in Apr 2010.

One Comment

  • Would be interesting to compare the time taken by MySQL when running that same query with a covering index (user_id + created_at) in place….

Leave a Reply