Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Thoughts About Column Compression, with Optional Predefined Dictionary

November 10, 2016

Author

Shahriyar Rzayev

Insight for DBAs

Share this Post:

This blog discusses column compression with an optional predefined dictionary.

Compression, more compression with different algorithms, compress again, compress multiple times! 🙂 Compression is a hot topic in our lives.

In general, testing new things is great if the processes are well-described and easy to follow. Let’s try to think like a QA engineer: the first golden rule of QA is “everything is buggy, life is full of bugs: good night bugs, good morning bugs, hello my old bug friends.”

The second golden rule of QA is “OK, now let’s find a way to catch a bug — but remember that your methods can be buggy, too.”

Remember: always test! No bugs, no happiness!

When you start to test, the first goal is getting an idea of what is going on. This blog will demonstrate a test scenario for column compression with an optional predefined dictionary. For reference on column compression, read Compressed columns with dictionaries.”

To begin, let’s set up a basic environment:
The installation process requires installing Percona Server which is already documented here -> PS 5.6 installation

Secondly, find an already existing test: xtradb_compressed_columns_ibd_sizes.test.

Third, write a simple script to get started:

import mysql.connector

cnx = mysql.connector.connect(user='msandbox', password='msandbox',
                              host='127.0.0.1',
                              database='dbtest',
                              port=22896,
                              autocommit=True)
cursor = cnx.cursor()

crt_comp_dic = "CREATE COMPRESSION_DICTIONARY names2 ('Bartholomew')"
cursor.execute(crt_comp_dic)

table_t1 = "CREATE TABLE t1(id INT,a BLOB) ENGINE=InnoDB"
table_t2 = "CREATE TABLE t2(id INT,a BLOB COLUMN_FORMAT COMPRESSED) ENGINE=InnoDB"
table_t3 = "CREATE TABLE t3(id INT,a BLOB COLUMN_FORMAT COMPRESSED WITH COMPRESSION_DICTIONARY names) ENGINE=InnoDB"

cursor.execute(table_t1);
cursor.execute(table_t2);
cursor.execute(table_t3);

insert_stmt = "insert into {} values({},repeat('Bartholomew', 128))"

for i in range(0, 100000):
	cursor.execute(insert_stmt.format('t1', int(i)))
	print insert_stmt.format('t1', int(i))
	cursor.execute(insert_stmt.format('t2', int(i)))
	print insert_stmt.format('t2', int(i))
	cursor.execute(insert_stmt.format('t3', int(i)))
	print insert_stmt.format('t3', int(i))


cursor.close()
cnx.close()

import mysql.connector

cnx = mysql.connector.connect(user='msandbox', password='msandbox',

host='127.0.0.1',

database='dbtest',

port=22896,

autocommit=True)

cursor = cnx.cursor()

crt_comp_dic = "CREATE COMPRESSION_DICTIONARY names2 ('Bartholomew')"

cursor.execute(crt_comp_dic)

table_t1 = "CREATE TABLE t1(id INT,a BLOB) ENGINE=InnoDB"

table_t2 = "CREATE TABLE t2(id INT,a BLOB COLUMN_FORMAT COMPRESSED) ENGINE=InnoDB"

table_t3 = "CREATE TABLE t3(id INT,a BLOB COLUMN_FORMAT COMPRESSED WITH COMPRESSION_DICTIONARY names) ENGINE=InnoDB"

cursor.execute(table_t1);

cursor.execute(table_t2);

cursor.execute(table_t3);

insert_stmt = "insert into {} values({},repeat('Bartholomew', 128))"

for i in range(0, 100000):

cursor.execute(insert_stmt.format('t1', int(i)))

print insert_stmt.format('t1', int(i))

cursor.execute(insert_stmt.format('t2', int(i)))

print insert_stmt.format('t2', int(i))

cursor.execute(insert_stmt.format('t3', int(i)))

print insert_stmt.format('t3', int(i))

cursor.close()

cnx.close()

As you might notice, column compression might be with or without a compression dictionary. The visible difference, of course, is in the size of the tables. If you want to compress columns based on a predefined dictionary, you should create it with frequently used data. It is possible to create an empty dictionary, but it will have no effect. (See here: #1628231.)

The result of running this script is:

100.000 rows tables

t1 -> uncompressedt2 -> compressed column, t3 -> compressed column with compression dictionary, ‘names2’ dictionary

t2 -> compressed column,

t3 -> compressed column with compression dictionary, ‘names2’ dictionary.

Table size difference:

sh@sh-ubuntu:~/sandboxes/rsandbox_percona-server-5_6_31/master/data/dbtest$ ls -lth | grep .ibd
-rw-rw---- 1 sh sh 168M Sep 29 23:43 t1.ibd
-rw-rw---- 1 sh sh  15M Sep 29 23:43 t2.ibd
-rw-rw---- 1 sh sh  14M Sep 29 23:43 t3.ibd

sh@sh-ubuntu:~/sandboxes/rsandbox_percona-server-5_6_31/master/data/dbtest$ ls -lth | grep .ibd

-rw-rw---- 1 sh sh 168M Sep 29 23:43 t1.ibd

-rw-rw---- 1 sh sh 15M Sep 29 23:43 t2.ibd

-rw-rw---- 1 sh sh 14M Sep 29 23:43 t3.ibd

After running an optimize table:

master [localhost] {msandbox} (dbtest) > optimize table t1;
+-----------+----------+----------+-------------------------------------------------------------------+
| Table     | Op       | Msg_type | Msg_text                                                          |
+-----------+----------+----------+-------------------------------------------------------------------+
| dbtest.t1 | optimize | note     | Table does not support optimize, doing recreate + analyze instead |
| dbtest.t1 | optimize | status   | OK                                                                |
+-----------+----------+----------+-------------------------------------------------------------------+
2 rows in set (1 min 35.88 sec)

master [localhost] {msandbox} (dbtest) > optimize table t2;
+-----------+----------+----------+-------------------------------------------------------------------+
| Table     | Op       | Msg_type | Msg_text                                                          |
+-----------+----------+----------+-------------------------------------------------------------------+
| dbtest.t2 | optimize | note     | Table does not support optimize, doing recreate + analyze instead |
| dbtest.t2 | optimize | status   | OK                                                                |
+-----------+----------+----------+-------------------------------------------------------------------+
2 rows in set (11.82 sec)

master [localhost] {msandbox} (dbtest) > optimize table t3;
+-----------+----------+----------+-------------------------------------------------------------------+
| Table     | Op       | Msg_type | Msg_text                                                          |
+-----------+----------+----------+-------------------------------------------------------------------+
| dbtest.t3 | optimize | note     | Table does not support optimize, doing recreate + analyze instead |
| dbtest.t3 | optimize | status   | OK                                                                |
+-----------+----------+----------+-------------------------------------------------------------------+
2 rows in set (7.89 sec)

master [localhost] {msandbox} (dbtest) > optimize table t1;

+-----------+----------+----------+-------------------------------------------------------------------+

+-----------+----------+----------+-------------------------------------------------------------------+

+-----------+----------+----------+-------------------------------------------------------------------+

2 rows in set (1 min 35.88 sec)

master [localhost] {msandbox} (dbtest) > optimize table t2;

+-----------+----------+----------+-------------------------------------------------------------------+

+-----------+----------+----------+-------------------------------------------------------------------+

+-----------+----------+----------+-------------------------------------------------------------------+

2 rows in set (11.82 sec)

master [localhost] {msandbox} (dbtest) > optimize table t3;

+-----------+----------+----------+-------------------------------------------------------------------+

+-----------+----------+----------+-------------------------------------------------------------------+

+-----------+----------+----------+-------------------------------------------------------------------+

2 rows in set (7.89 sec)

The resulted size:

sh@sh-ubuntu:~/sandboxes/rsandbox_percona-server-5_6_31/master/data/dbtest$ ls -lh | grep .ibd
-rw-rw---- 1 sh sh 160M Sep 29 23:52 t1.ibd
-rw-rw---- 1 sh sh 8.0M Sep 29 23:52 t2.ibd
-rw-rw---- 1 sh sh 7.0M Sep 29 23:52 t3.ibd

sh@sh-ubuntu:~/sandboxes/rsandbox_percona-server-5_6_31/master/data/dbtest$ ls -lh | grep .ibd

-rw-rw---- 1 sh sh 160M Sep 29 23:52 t1.ibd

-rw-rw---- 1 sh sh 8.0M Sep 29 23:52 t2.ibd

-rw-rw---- 1 sh sh 7.0M Sep 29 23:52 t3.ibd

I want more:

master [localhost] {msandbox} (dbtest) > alter table t1 row_format=compressed;
Query OK, 0 rows affected (2 min 38.85 sec)
Records: 0  Duplicates: 0  Warnings: 0

master [localhost] {msandbox} (dbtest) > alter table t2 row_format=compressed;
Query OK, 0 rows affected (14.41 sec)
Records: 0  Duplicates: 0  Warnings: 0

master [localhost] {msandbox} (dbtest) > alter table t3 row_format=compressed;
Query OK, 0 rows affected (10.74 sec)
Records: 0  Duplicates: 0  Warnings: 0

master [localhost] {msandbox} (dbtest) > alter table t1 row_format=compressed;

Query OK, 0 rows affected (2 min 38.85 sec)

Records: 0 Duplicates: 0 Warnings: 0

master [localhost] {msandbox} (dbtest) > alter table t2 row_format=compressed;

Query OK, 0 rows affected (14.41 sec)

Records: 0 Duplicates: 0 Warnings: 0

master [localhost] {msandbox} (dbtest) > alter table t3 row_format=compressed;

Query OK, 0 rows affected (10.74 sec)

Records: 0 Duplicates: 0 Warnings: 0

Using ROW_FORMAT=COMPRESSED requires innodb_file_format to be > Antelope. But this is not true for COLUMN_FORMAT.

Again, check the size:

sh@sh-ubuntu:~/sandboxes/rsandbox_percona-server-5_6_31/master/data/dbtest$ ls -lh | grep .ibd
-rw-rw---- 1 sh sh  76M Sep 29 23:57 t1.ibd
-rw-rw---- 1 sh sh 4.0M Sep 29 23:58 t2.ibd
-rw-rw---- 1 sh sh 4.0M Sep 29 23:58 t3.ibd

sh@sh-ubuntu:~/sandboxes/rsandbox_percona-server-5_6_31/master/data/dbtest$ ls -lh | grep .ibd

-rw-rw---- 1 sh sh 76M Sep 29 23:57 t1.ibd

-rw-rw---- 1 sh sh 4.0M Sep 29 23:58 t2.ibd

-rw-rw---- 1 sh sh 4.0M Sep 29 23:58 t3.ibd

Question: How do I get information about column compression dictionary and tables? Answer: tables from information_schema:

master [localhost] {msandbox} ((none)) > SELECT * FROM information_schema.xtradb_zip_dict;
+----+--------+-------------+
| id | name   | zip_dict    |
+----+--------+-------------+
|  1 | names  | Bartholomew |
|  2 | names2 | Bartholomew |
+----+--------+-------------+
2 rows in set (0.00 sec)

master [localhost] {msandbox} ((none)) > SELECT * FROM information_schema.xtradb_zip_dict_cols;
+----------+------------+---------+
| table_id | column_pos | dict_id |
+----------+------------+---------+
|       67 |          1 |       1 |
+----------+------------+---------+
1 row in set (0.00 sec)

master [localhost] {msandbox} ((none)) > SELECT * FROM information_schema.xtradb_zip_dict;

+----+--------+-------------+

| id | name | zip_dict |

+----+--------+-------------+

| 1 | names | Bartholomew |

| 2 | names2 | Bartholomew |

+----+--------+-------------+

2 rows in set (0.00 sec)

master [localhost] {msandbox} ((none)) > SELECT * FROM information_schema.xtradb_zip_dict_cols;

+----------+------------+---------+

| table_id | column_pos | dict_id |

+----------+------------+---------+

| 67 | 1 | 1 |

+----------+------------+---------+

1 row in set (0.00 sec)

Question: How do I drop the compression dictionary? Answer: if it is in use, you will get:

master [localhost] {msandbox} (dbtest) > drop COMPRESSION_DICTIONARY `names`;
ERROR 1894 (HY000): Compression dictionary 'names' is in use

1 2	master [localhost] {msandbox} (dbtest) > drop COMPRESSION_DICTIONARY `names`; ERROR 1894 (HY000): Compression dictionary 'names' is in use

Before dropping it, make sure there are no tables using the dictionary. There is an extreme condition where you are unable to drop the dictionary (see #1628824).

Question: Great! How about mysqldump? Answer: read here: mysqldump.

I might make this the topic of a dedicated post. Thanks for reading!

0 0 votes

Article Rating

1 Comment

Oldest

Newest Most Voted

Admin

Peter Zaitsev

9 years ago

Hi Shahriyar,

I would point out you use example of relatively long and highly compressible content – in this case the custom dictionary does not help too much. The real benefit of compression dictionary is when the document is relatively short so by itself it does not allow too much compression. For example It would be great to see the different shown for JSON documents of 500 bytes or so where dictionary contains typical field names and values.