This blog discusses column compression with an optional predefined dictionary.
Compression, more compression with different algorithms, compress again, compress multiple times! 🙂 Compression is a hot topic in our lives.
In general, testing new things is great if the processes are well-described and easy to follow. Let’s try to think like a QA engineer: the first golden rule of QA is “everything is buggy, life is full of bugs: good night bugs, good morning bugs, hello my old bug friends.”
The second golden rule of QA is “OK, now let’s find a way to catch a bug — but remember that your methods can be buggy, too.”
Remember: always test! No bugs, no happiness!
When you start to test, the first goal is getting an idea of what is going on. This blog will demonstrate a test scenario for column compression with an optional predefined dictionary. For reference on column compression, read Compressed columns with dictionaries.”
To begin, let’s set up a basic environment:
The installation process requires installing Percona Server which is already documented here -> PS 5.6 installation
Secondly, find an already existing test: xtradb_compressed_columns_ibd_sizes.test.
Third, write a simple script to get started:
|
1 |
import mysql.connector<br><br>cnx = mysql.connector.connect(user='msandbox', password='msandbox',<br> host='127.0.0.1',<br> database='dbtest',<br> port=22896,<br> autocommit=True)<br>cursor = cnx.cursor()<br><br>crt_comp_dic = "CREATE COMPRESSION_DICTIONARY names2 ('Bartholomew')"<br>cursor.execute(crt_comp_dic)<br><br>table_t1 = "CREATE TABLE t1(id INT,a BLOB) ENGINE=InnoDB"<br>table_t2 = "CREATE TABLE t2(id INT,a BLOB COLUMN_FORMAT COMPRESSED) ENGINE=InnoDB"<br>table_t3 = "CREATE TABLE t3(id INT,a BLOB COLUMN_FORMAT COMPRESSED WITH COMPRESSION_DICTIONARY names) ENGINE=InnoDB"<br><br>cursor.execute(table_t1);<br>cursor.execute(table_t2);<br>cursor.execute(table_t3);<br><br>insert_stmt = "insert into {} values({},repeat('Bartholomew', 128))"<br><br>for i in range(0, 100000):<br> cursor.execute(insert_stmt.format('t1', int(i)))<br> print insert_stmt.format('t1', int(i))<br> cursor.execute(insert_stmt.format('t2', int(i)))<br> print insert_stmt.format('t2', int(i))<br> cursor.execute(insert_stmt.format('t3', int(i)))<br> print insert_stmt.format('t3', int(i))<br><br><br>cursor.close()<br>cnx.close() |
As you might notice, column compression might be with or without a compression dictionary. The visible difference, of course, is in the size of the tables. If you want to compress columns based on a predefined dictionary, you should create it with frequently used data. It is possible to create an empty dictionary, but it will have no effect. (See here: #1628231.)
The result of running this script is:
100.000 rows tables
t1 -> uncompressedt2 -> compressed column, t3 -> compressed column with compression dictionary, ‘names2’ dictionary
t2 -> compressed column,
t3 -> compressed column with compression dictionary, ‘names2’ dictionary.
Table size difference:
|
1 |
sh@sh-ubuntu:~/sandboxes/rsandbox_percona-server-5_6_31/master/data/dbtest$ ls -lth | grep .ibd<br>-rw-rw---- 1 sh sh 168M Sep 29 23:43 t1.ibd<br>-rw-rw---- 1 sh sh 15M Sep 29 23:43 t2.ibd<br>-rw-rw---- 1 sh sh 14M Sep 29 23:43 t3.ibd<br> |
After running an optimize table:
|
1 |
master [localhost] {msandbox} (dbtest) > optimize table t1;<br>+-----------+----------+----------+-------------------------------------------------------------------+<br>| Table | Op | Msg_type | Msg_text |<br>+-----------+----------+----------+-------------------------------------------------------------------+<br>| dbtest.t1 | optimize | note | Table does not support optimize, doing recreate + analyze instead |<br>| dbtest.t1 | optimize | status | OK |<br>+-----------+----------+----------+-------------------------------------------------------------------+<br>2 rows in set (1 min 35.88 sec)<br><br>master [localhost] {msandbox} (dbtest) > optimize table t2;<br>+-----------+----------+----------+-------------------------------------------------------------------+<br>| Table | Op | Msg_type | Msg_text |<br>+-----------+----------+----------+-------------------------------------------------------------------+<br>| dbtest.t2 | optimize | note | Table does not support optimize, doing recreate + analyze instead |<br>| dbtest.t2 | optimize | status | OK |<br>+-----------+----------+----------+-------------------------------------------------------------------+<br>2 rows in set (11.82 sec)<br><br>master [localhost] {msandbox} (dbtest) > optimize table t3;<br>+-----------+----------+----------+-------------------------------------------------------------------+<br>| Table | Op | Msg_type | Msg_text |<br>+-----------+----------+----------+-------------------------------------------------------------------+<br>| dbtest.t3 | optimize | note | Table does not support optimize, doing recreate + analyze instead |<br>| dbtest.t3 | optimize | status | OK |<br>+-----------+----------+----------+-------------------------------------------------------------------+<br>2 rows in set (7.89 sec)<br> |
The resulted size:
|
1 |
sh@sh-ubuntu:~/sandboxes/rsandbox_percona-server-5_6_31/master/data/dbtest$ ls -lh | grep .ibd<br>-rw-rw---- 1 sh sh 160M Sep 29 23:52 t1.ibd<br>-rw-rw---- 1 sh sh 8.0M Sep 29 23:52 t2.ibd<br>-rw-rw---- 1 sh sh 7.0M Sep 29 23:52 t3.ibd |
I want more:
|
1 |
master [localhost] {msandbox} (dbtest) > alter table t1 row_format=compressed;<br>Query OK, 0 rows affected (2 min 38.85 sec)<br>Records: 0 Duplicates: 0 Warnings: 0<br><br>master [localhost] {msandbox} (dbtest) > alter table t2 row_format=compressed;<br>Query OK, 0 rows affected (14.41 sec)<br>Records: 0 Duplicates: 0 Warnings: 0<br><br>master [localhost] {msandbox} (dbtest) > alter table t3 row_format=compressed;<br>Query OK, 0 rows affected (10.74 sec)<br>Records: 0 Duplicates: 0 Warnings: 0<br> |
Using ROW_FORMAT=COMPRESSED requires innodb_file_format to be > Antelope. But this is not true for COLUMN_FORMAT.
Again, check the size:
|
1 |
sh@sh-ubuntu:~/sandboxes/rsandbox_percona-server-5_6_31/master/data/dbtest$ ls -lh | grep .ibd<br>-rw-rw---- 1 sh sh 76M Sep 29 23:57 t1.ibd<br>-rw-rw---- 1 sh sh 4.0M Sep 29 23:58 t2.ibd<br>-rw-rw---- 1 sh sh 4.0M Sep 29 23:58 t3.ibd |
Question: How do I get information about column compression dictionary and tables? Answer: tables from information_schema:
|
1 |
master [localhost] {msandbox} ((none)) > SELECT * FROM information_schema.xtradb_zip_dict;<br>+----+--------+-------------+<br>| id | name | zip_dict |<br>+----+--------+-------------+<br>| 1 | names | Bartholomew |<br>| 2 | names2 | Bartholomew |<br>+----+--------+-------------+<br>2 rows in set (0.00 sec)<br><br>master [localhost] {msandbox} ((none)) > SELECT * FROM information_schema.xtradb_zip_dict_cols;<br>+----------+------------+---------+<br>| table_id | column_pos | dict_id |<br>+----------+------------+---------+<br>| 67 | 1 | 1 |<br>+----------+------------+---------+<br>1 row in set (0.00 sec)<br> |
Question: How do I drop the compression dictionary? Answer: if it is in use, you will get:
|
1 |
master [localhost] {msandbox} (dbtest) > drop COMPRESSION_DICTIONARY `names`;<br>ERROR 1894 (HY000): Compression dictionary 'names' is in use |
Before dropping it, make sure there are no tables using the dictionary. There is an extreme condition where you are unable to drop the dictionary (see #1628824).
Question: Great! How about mysqldump? Answer: read here: mysqldump.
I might make this the topic of a dedicated post. Thanks for reading!