October 20, 2014

Data compression in InnoDB for text and blob fields

Have you wanted to compress only certain types of columns in a table while leaving other columns uncompressed? While working on a customer case this week I saw an interesting problem where a table had many heavily utilized TEXT fields with some read queries exceeding 500MB (!!), and stored in a 100GB table. In this case we were not allowed to make any query or application logic changes so we chose to implement the Barracuda file format and utilize compressed rows as this appealed to me for this mostly-read application. One quick way you can see if your rows will benefit from compression would be to read Peter Zaitsev’s blog post and execute:

compare this to:

In our case we saw about a 75% reduction when the TEXT field was compressed which we felt indicated there would be a benefit derived from table compression.

With the original InnoDB Antelope file format you have the choice of ROW_FORMAT=COMPACT and ROW_FORMAT=REDUNDANT where InnoDB stored the first 768 bytes of variable length columns (BLOB, VARCHAR, TEXT) in the index record, and the remainder stored in overflow pages.  COMPACT became the default after MySQL 5.0.3 and has a more compact representation for nulls and variable-length fields than REDUNDANT.

Using InnoDB’s new Barracuda file format (available since InnoDB plugin 1.1 or MySQL 5.5) you can now leverage table compression by specifying ROW_FORMAT=COMPRESSED.  In our case we only wanted MySQL to try to move the larger (greater than 16KB) TEXT fields  off-page so we utilized the KEY_BLOCK_SIZE=16 directive.  This means that each TEXT / BLOB field that exceeds 16KB it would be stored in it’s own page (less the 20 byte pointer stored in the index page).  Based on our analysis 75% of the blobs stored in the table were over 8KB, which were responsible for 90% of space usage hence compressing only externally stored blobs provided substaintial advantages.  Why did we choose a KEY_BLOCK_SIZE that is the same value of the InnoDB page size of 16KB?  As the fine MySQL manual states:

This setting may still be useful for tables with many long BLOBVARCHAR or TEXT columns, because such values often do compress well, and might therefore require fewer “overflow” pages.

I did not test with a smaller KEY_BLOCK_SIZE as we had minimal time to effect the table compression modification (given the long run-time of the ALTER TABLE), you may find your application benefits from a different KEY_BLOCK_SIZE value.  Also note that you need to enable the  dynamic variable innodb_file_format=BARRACUDA (don’t forget to set it in my.cnf!):

One caveat: you must be running with innodb_file_per_table=1 as the InnoDB system tablespace cannot be compressed, see this page for further details on how to enable compression for a table.

To utilize Barracuda format tables you will need to create them new and migrate data, or affect existing tables with an ALTER TABLE statement. As table compression is table specific, ROW_FORMAT and KEY_BLOCK_SIZE directives are passed via CREATE TABLE or ALTER TABLE statements. In our case, we chose to re-build the table using ALTER TABLE via a null-operation like this:

In our case even though the customer had a 3GHz 24-core machine the ALTER TABLE was progressing slowly as it was bound to a single CPU while compressing the data. Just have patience. :) Keep in mind too that if you started with a 100GB table and assuming you know your approximate compression rate, you will be left with a considerably smaller on-disk footprint so ideally you will be able to postpone that purchase of additional disk capacity.

So what was the real-world outcome of this exercise?  We were able to show a 70% improvement in queries against this table when the TEXT fields were not part of the query request due to Barracuda  not storing 768 bytes of the blob on field, and reduce the table down to 30GB.  Happy customer :)

One parting idea: you may be able to leverage pt-online-schema-change from Percona Toolkit 2.1 in order to modify the table if you cannot sustain the blocking effects of a traditional ALTER TABLE statement.

I hope this helps you understand a use case where table compression can be beneficial when your workload is mostly-read. Thanks for reading my first percona.com/blog blog post!

About Michael Coburn

Michael joined Percona as a Consultant in 2012 after having worked with high volume stock photography websites and email service provider platforms. WIth a foundation in Systems Administration, Michael enjoys working with SAN technologies and high availability solutions. A Canadian, Michael currently lives in the Raleigh, North Carolina area with his wife, two children, and two dogs.

Comments

  1. aftab says:

    Why do you suggest table compression is helpful when workload is mostly-read only? Is not helpful when the load is mostly write and why?

  2. Michael Coburn says:

    Hi aftab, sorry I didn’t reply to this sooner.

    You gain a significant performance boost by utilizing compression on mostly-read workloads because there is less disk IO required to retrieve InnoDB pages, which translates (usually) into faster query response time as the database server is blocked less on disk IO.

    Your mileage may vary on write-heavy workloads if you do not have many cores or fast enough cores as you then may move the bottleneck to the CPU while waiting for compression to run on the relevant pages before committing to disk. For this reason it is less clear-cut to recommend compression — we recommend that you test your specific queries on your own dataset before implementing compression.

    Thanks for the question aftab !

Speak Your Mind

*