Thoughts About Column Compression, with Optional Predefined Dictionary

column compressionThis blog discusses column compression with an optional predefined dictionary.

Compression, more compression with different algorithms, compress again, compress multiple times! ūüôā Compression is a hot topic in our lives.

In general, testing new things is great if the processes are well-described and easy to follow.¬†Let’s try to think like a QA engineer: the first golden rule of QA is “everything is buggy, life is full of bugs: good night bugs, good morning bugs, hello my old bug¬†friends.”

The second golden rule of QA is “OK, now let’s find a way to catch a bug — but remember that your methods can be buggy, too.”

Remember: always test! No bugs, no happiness!

When you start to test, the first goal is getting an¬†idea¬†of what is going on. This blog will demonstrate a test scenario for column compression with an optional predefined dictionary.¬†For reference on column compression, read¬†Compressed columns with dictionaries.”

To begin, let’s set up a basic environment:
The installation process requires installing Percona Server which is already documented here -> PS 5.6 installation

Secondly, find an already existing test: xtradb_compressed_columns_ibd_sizes.test.

Third, write a simple script to get started:

As you might notice, column compression might be with or without a compression dictionary. The visible difference, of course, is in the size of the tables. If you want to compress columns based on a predefined dictionary, you should create it with frequently used data. It is possible to create an empty dictionary, but it will have no effect. (See here: #1628231.)

The result of running this script is:

100.000 rows tables

t1 -> uncompressedt2 -> compressed column, t3 -> compressed column with compression dictionary, ‘names2’ dictionary

t2 -> compressed column,

t3 -> compressed column with compression dictionary, ‘names2’ dictionary.

Table size difference:

After running an optimize table:

The resulted size:

I want more:

Using  ROW_FORMAT=COMPRESSED requires innodb_file_format  to be >  Antelope. But this is not true for COLUMN_FORMAT.

Again, check the size:

Question: How do I get information about column compression dictionary and tables? Answer: tables from information_schema:

Question: How do I drop the compression dictionary? Answer: if it is in use, you will get:

Before dropping it, make sure there are no tables using the dictionary. There is an extreme condition where you are unable to drop the dictionary (see #1628824).

Question: Great! How about mysqldump? Answer: read here: mysqldump.

I might make this the topic of a dedicated post. Thanks for reading!

Share this post

Comment (1)

  • Peter Zaitsev Reply

    Hi Shahriyar,

    I would point out you use example of relatively long and highly compressible content – in this case the custom dictionary does not help too much. The real benefit of compression dictionary is when the document is relatively short so by itself it does not allow too much compression. For example It would be great to see the different shown for JSON documents of 500 bytes or so where dictionary contains typical field names and values.

    November 11, 2016 at 4:52 am

Leave a Reply