This blog discusses column compression with an optional predefined dictionary.
Compression, more compression with different algorithms, compress again, compress multiple times! 🙂 Compression is a hot topic in our lives.
In general, testing new things is great if the processes are well-described and easy to follow. Let’s try to think like a QA engineer: the first golden rule of QA is “everything is buggy, life is full of bugs: good night bugs, good morning bugs, hello my old bug friends.”
The second golden rule of QA is “OK, now let’s find a way to catch a bug — but remember that your methods can be buggy, too.”
Remember: always test! No bugs, no happiness!
When you start to test, the first goal is getting an idea of what is going on. This blog will demonstrate a test scenario for column compression with an optional predefined dictionary. For reference on column compression, read Compressed columns with dictionaries.”
To begin, let’s set up a basic environment:
The installation process requires installing Percona Server which is already documented here -> PS 5.6 installation
Secondly, find an already existing test: xtradb_compressed_columns_ibd_sizes.test.
Third, write a simple script to get started:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
import mysql.connector cnx = mysql.connector.connect(user='msandbox', password='msandbox', host='127.0.0.1', database='dbtest', port=22896, autocommit=True) cursor = cnx.cursor() crt_comp_dic = "CREATE COMPRESSION_DICTIONARY names2 ('Bartholomew')" cursor.execute(crt_comp_dic) table_t1 = "CREATE TABLE t1(id INT,a BLOB) ENGINE=InnoDB" table_t2 = "CREATE TABLE t2(id INT,a BLOB COLUMN_FORMAT COMPRESSED) ENGINE=InnoDB" table_t3 = "CREATE TABLE t3(id INT,a BLOB COLUMN_FORMAT COMPRESSED WITH COMPRESSION_DICTIONARY names) ENGINE=InnoDB" cursor.execute(table_t1); cursor.execute(table_t2); cursor.execute(table_t3); insert_stmt = "insert into {} values({},repeat('Bartholomew', 128))" for i in range(0, 100000): cursor.execute(insert_stmt.format('t1', int(i))) print insert_stmt.format('t1', int(i)) cursor.execute(insert_stmt.format('t2', int(i))) print insert_stmt.format('t2', int(i)) cursor.execute(insert_stmt.format('t3', int(i))) print insert_stmt.format('t3', int(i)) cursor.close() cnx.close() |
As you might notice, column compression might be with or without a compression dictionary. The visible difference, of course, is in the size of the tables. If you want to compress columns based on a predefined dictionary, you should create it with frequently used data. It is possible to create an empty dictionary, but it will have no effect. (See here: #1628231.)
The result of running this script is:
100.000 rows tables
t1 -> uncompressedt2 -> compressed column, t3 -> compressed column with compression dictionary, ‘names2’ dictionary
t2 -> compressed column,
t3 -> compressed column with compression dictionary, ‘names2’ dictionary.
Table size difference:
|
1 2 3 4 |
sh@sh-ubuntu:~/sandboxes/rsandbox_percona-server-5_6_31/master/data/dbtest$ ls -lth | grep .ibd -rw-rw---- 1 sh sh 168M Sep 29 23:43 t1.ibd -rw-rw---- 1 sh sh 15M Sep 29 23:43 t2.ibd -rw-rw---- 1 sh sh 14M Sep 29 23:43 t3.ibd |
After running an optimize table:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
master [localhost] {msandbox} (dbtest) > optimize table t1; +-----------+----------+----------+-------------------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +-----------+----------+----------+-------------------------------------------------------------------+ | dbtest.t1 | optimize | note | Table does not support optimize, doing recreate + analyze instead | | dbtest.t1 | optimize | status | OK | +-----------+----------+----------+-------------------------------------------------------------------+ 2 rows in set (1 min 35.88 sec) master [localhost] {msandbox} (dbtest) > optimize table t2; +-----------+----------+----------+-------------------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +-----------+----------+----------+-------------------------------------------------------------------+ | dbtest.t2 | optimize | note | Table does not support optimize, doing recreate + analyze instead | | dbtest.t2 | optimize | status | OK | +-----------+----------+----------+-------------------------------------------------------------------+ 2 rows in set (11.82 sec) master [localhost] {msandbox} (dbtest) > optimize table t3; +-----------+----------+----------+-------------------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +-----------+----------+----------+-------------------------------------------------------------------+ | dbtest.t3 | optimize | note | Table does not support optimize, doing recreate + analyze instead | | dbtest.t3 | optimize | status | OK | +-----------+----------+----------+-------------------------------------------------------------------+ 2 rows in set (7.89 sec) |
The resulted size:
|
1 2 3 4 |
sh@sh-ubuntu:~/sandboxes/rsandbox_percona-server-5_6_31/master/data/dbtest$ ls -lh | grep .ibd -rw-rw---- 1 sh sh 160M Sep 29 23:52 t1.ibd -rw-rw---- 1 sh sh 8.0M Sep 29 23:52 t2.ibd -rw-rw---- 1 sh sh 7.0M Sep 29 23:52 t3.ibd |
I want more:
|
1 2 3 4 5 6 7 8 9 10 11 |
master [localhost] {msandbox} (dbtest) > alter table t1 row_format=compressed; Query OK, 0 rows affected (2 min 38.85 sec) Records: 0 Duplicates: 0 Warnings: 0 master [localhost] {msandbox} (dbtest) > alter table t2 row_format=compressed; Query OK, 0 rows affected (14.41 sec) Records: 0 Duplicates: 0 Warnings: 0 master [localhost] {msandbox} (dbtest) > alter table t3 row_format=compressed; Query OK, 0 rows affected (10.74 sec) Records: 0 Duplicates: 0 Warnings: 0 |
Using ROW_FORMAT=COMPRESSED requires innodb_file_format to be > Antelope. But this is not true for COLUMN_FORMAT.
Again, check the size:
|
1 2 3 4 |
sh@sh-ubuntu:~/sandboxes/rsandbox_percona-server-5_6_31/master/data/dbtest$ ls -lh | grep .ibd -rw-rw---- 1 sh sh 76M Sep 29 23:57 t1.ibd -rw-rw---- 1 sh sh 4.0M Sep 29 23:58 t2.ibd -rw-rw---- 1 sh sh 4.0M Sep 29 23:58 t3.ibd |
Question: How do I get information about column compression dictionary and tables? Answer: tables from information_schema:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
master [localhost] {msandbox} ((none)) > SELECT * FROM information_schema.xtradb_zip_dict; +----+--------+-------------+ | id | name | zip_dict | +----+--------+-------------+ | 1 | names | Bartholomew | | 2 | names2 | Bartholomew | +----+--------+-------------+ 2 rows in set (0.00 sec) master [localhost] {msandbox} ((none)) > SELECT * FROM information_schema.xtradb_zip_dict_cols; +----------+------------+---------+ | table_id | column_pos | dict_id | +----------+------------+---------+ | 67 | 1 | 1 | +----------+------------+---------+ 1 row in set (0.00 sec) |
Question: How do I drop the compression dictionary? Answer: if it is in use, you will get:
|
1 2 |
master [localhost] {msandbox} (dbtest) > drop COMPRESSION_DICTIONARY `names`; ERROR 1894 (HY000): Compression dictionary 'names' is in use |
Before dropping it, make sure there are no tables using the dictionary. There is an extreme condition where you are unable to drop the dictionary (see #1628824).
Question: Great! How about mysqldump? Answer: read here: mysqldump.
I might make this the topic of a dedicated post. Thanks for reading!