In this blog post, I will talk about archiving MySQL tables in ClickHouse for storage and analytics.
Hard drives are cheap nowadays, but storing lots of data in MySQL is not practical and can cause all sorts of performance bottlenecks. To name just a few issues:
The answer is archiving old data. Archiving does not necessarily mean that the data will be permanently removed. Instead, the archived data can be placed into long-term storage (i.e., AWS S3) or loaded into a special purpose database that is optimized for storage (with compression) and reporting. The data is then available.
Actually, there are multiple use cases:
In all of those cases, we can move the older data away from MySQL and load it into a “big data” solution. Even if the data needs to be available, we can still move it from the main MySQL server to another system. In this blog post, I will look at archiving MySQL tables in ClickHouse for long-term storage and real-time queries.
Let’s say we have a 650G table that stores the history of all transactions, and we want to start archiving it. How can we approach this?
First, we will need to split this table into “old” and “new”. I assume that the table is not partitioned (partitioned tables are much easier to deal with). For example, if we have data from 2008 (ten years worth) but only need to store data from the last two months in the main MySQL environment, then deleting the old data would be challenging. So instead of deleting 99% of the data from a huge table, we can create a new table and load the newer data into that. Then rename (swap) the tables. The process might look like this:
Second, we need to move the transactions_old into ClickHouse. This is straightforward — we can pipe data from MySQL to ClickHouse directly. To demonstrate I will use the Wikipedia:Statistics project (a real log of all requests to Wikipedia pages).
Create a table in ClickHouse:
|
1 |
CREATE TABLE wikistat<br>(<br> id bigint, <br> dt DateTime, <br> project String, <br> subproject String, <br> path String, <br> hits UInt64, <br> size UInt64<br>)<br>ENGINE = MergeTree<br>PARTITION BY toYYYYMMDD(dt)<br>ORDER BY dt<br><br>Ok.<br><br>0 rows in set. Elapsed: 0.010 sec. |
Please note that I’m using the new ClickHouse custom partitioning. It does not require that you create a separate date column to map the table in MySQL to the same table structure in ClickHouse
Now I can “pipe” data directly from MySQL to ClickHouse:
|
1 |
mysql --quick -h localhost wikistats -NBe <br>"SELECT concat(id,',"',dt,'","',project,'","',subproject,'","', path,'",',hits,',',size) FROM wikistats" | <br>clickhouse-client -d wikistats --query="INSERT INTO wikistats FORMAT CSV" |
Third, we need to set up a constant archiving process so that the data is removed from MySQL and transferred to ClickHouse. To do that we can use the “pt-archiver” tool (part of Percona Toolkit). In this case, we can first archive to a file and then load that file to ClickHouse. Here is the example:
Remove data from MySQL and load to a file (tsv):
|
1 |
pt-archiver --source h=localhost,D=wikistats,t=wikistats,i=dt --where "dt <= '2018-01-01 0:00:00'" --file load_to_clickhouse.txt --bulk-delete --limit 100000 --progress=100000<br>TIME ELAPSED COUNT<br>2018-01-25T18:19:59 0 0<br>2018-01-25T18:20:08 8 100000<br>2018-01-25T18:20:17 18 200000<br>2018-01-25T18:20:26 27 300000<br>2018-01-25T18:20:36 36 400000<br>2018-01-25T18:20:45 45 500000<br>2018-01-25T18:20:54 54 600000<br>2018-01-25T18:21:03 64 700000<br>2018-01-25T18:21:13 73 800000<br>2018-01-25T18:21:23 83 900000<br>2018-01-25T18:21:32 93 1000000<br>2018-01-25T18:21:42 102 1100000<br>... |
Load the file to ClickHouse:
|
1 |
cat load_to_clickhouse.txt | clickhouse-client -d wikistats --query="INSERT INTO wikistats FORMAT TSV" |
The newer version of pt-archiver can use a CSV format as well:
|
1 |
pt-archiver --source h=localhost,D=wikitest,t=wikistats,i=dt --where "dt <= '2018-01-01 0:00:00'" --file load_to_clickhouse.csv --output-format csv --bulk-delete --limit 10000 --progress=10000 |
Actually, it is much faster in ClickHouse. Even the queries that are based on index scans can be much slower in MySQL compared to ClickHouse.
For example, in MySQL just counting the number of rows for one year can take 34 seconds (index scan):
|
1 |
mysql> select count(*) from wikistats where dt between '2017-01-01 00:00:00' and '2017-12-31 00:00:00';<br>+-----------+<br>| count(*) |<br>+-----------+<br>| 103161991 |<br>+-----------+<br>1 row in set (34.82 sec)<br><br>mysql> explain select count(*) from wikistats where dt between '2017-01-01 00:00:00' and '2017-12-31 00:00:00'G<br>*************************** 1. row ***************************<br> id: 1<br> select_type: SIMPLE<br> table: wikistats<br> partitions: NULL<br> type: range<br>possible_keys: dt<br> key: dt<br> key_len: 6<br> ref: NULL<br> rows: 227206802<br> filtered: 100.00<br> Extra: Using where; Using index<br>1 row in set, 1 warning (0.00 sec) |
In ClickHouse, it only takes 0.062 sec:
|
1 |
:) select count(*) from wikistats where dt between toDateTime('2017-01-01 00:00:00') and toDateTime('2017-12-31 00:00:00');<br><br>SELECT count(*)<br>FROM wikistats<br>WHERE (dt >= toDateTime('2017-01-01 00:00:00')) AND (dt <= toDateTime('2017-12-31 00:00:00'))<br><br>┌───count()─┐<br>│ 103161991 │<br>└───────────┘<br><br>1 rows in set. Elapsed: 0.062 sec. Processed 103.16 million rows, 412.65 MB (1.67 billion rows/s., 6.68 GB/s.)<br> |
In my previous blog on comparing ClickHouse to Apache Spark to MariaDB, I also compared disk size. Usually, we can expect a 10x to 5x decrease in disk size in ClickHouse due to compression. Wikipedia:Statistics, for example, contains actual URIs, which can be quite large due to the article name/search phrase. This can be compressed very well. If we use only integers or use MD5 / SHA1 hashes instead of storing actual URIs, we can expect much smaller compression (i.e., 3x). Even with a 3x compression ratio, it is still pretty good as long-term storage.
As the data in MySQL keeps growing, the performance for all the queries will keep decreasing. Typically, queries that originally took milliseconds can now take seconds (or more). That requires a lot of changes (code, MySQL, etc.) to make faster.
The main goal of archiving the data is to increase performance (“make MySQL fast again”), decrease costs and improve ease of maintenance (backup/restore, cloning the replication slave, etc.). Archiving to ClickHouse allows you to preserve old data and make it available for reports.
Resources
RELATED POSTS