September 22, 2014

MySQL Blob Compression performance benefits

When you’re storing text of significant size in the table it often makes sense to keep it compressed. Unfortunately MySQL does not provide compressed BLOB/TEXT columns (I would really love to have COMPRESSED attribute for the BLOB/TEXT columns which would make them transparently compressed) but you well can do it yourself by using COMPRESS/UNCOMPRESS functions or compressing/decompressing things on the client.

This choice of server side vs client size compression depends on the queries you have. If you just retrieve these BLOB/TEXT columns for say displaying it in application compressing on application is typically better idea because you will save CPU on database box and move it to the Web box which is typically easier to scale, plus you save on network traffic a bit, though this is not generally the problem.

Compression on the server gives you ability to do some server side filtering (which however would not be overly fast as each row will need to be decompressed) it also makes data easier to work with with console MySQL tools.

Some people get best of both worlds by taking COMPRESS/UNCOMPRESS functions from the MySQL code and having them work on the application side as well which gives you best of both approaches.

One thing you should aware with compression – result of COMPRESS function is always binary, which means even if you’re compressing data from TEXT column you should pick BLOB type for compressed data storage.

Before going ahead with compression I usually run some checks to see how much compression will benefit you.
SELECT AVG(LENGTH(body)) FROM data vs SELECT AVG(LENGTH(COMPRESS(body))) FROM data gives good data point. However it may not present you with true data size and performance gains, especially for Innodb tables.

With Innodb tables BLOB gets its own page (allocated outside of clustered index) if the whole row does not fit in the page (longer than about 8000 bytes in size). The whole page is allocated even if you’re just a bit short plus BLOB retrieval unless it is in cache will require an extra disk seek. Also remember each BLOB which does not fit in the main row page is given its own page or set of pages potentially requiring a lot of disk seeks for row retrieval and a lot of space wasted.

So compressing BLOB/TEXT you should also think about how many rows you will be able to shrink down so they fit back to the page or at least reduce number of BLOBs on the row which need to be stored externally.

As an example in one case I’ve seen the compression showing row sizes will reduce about 3 times after compression (some XML data) while in practice table size reduced over 5 times and performance gains were even better because much fewer rows required external storage now.

The BLOB/TEXT column compression is not for all workloads. When your database is small enough to mostly fit in the cache it can cause CPU waste if compressing on database size with almost no gains. Small BLOBs is another case when compression may not give any gains and sometimes even increase the data size rather than reducing it. However it works great for many standard Web application needs – storing blog posts, product descriptions, forum posts (if they are generally long enough) and other data sets.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. We’re using compression in our DB and mostly happy with it.

    The biggest problem we’ve seen is that the mysql compress/uncompress functions use zlib format but don’t STORE the data in zlib format.

    This is just insane.

    This means you can’t take zlib data compressed in your client by the reference C zlib impl and uncompress it on MySQL.

    What’s worse is that there appears to be no reason why the data needs to be mangled. They strip out the header information to save a few bytes per compressed blob.

    I’d rather have compatibility thank you.

    I’ve been meaning on putting together a patch but just haven’t had time.

    I think we ended up going with gzip instead of zlib. Gzip is a bit more portable and zlib supports it as well. The only problem is that mysql ships without gzip support due to patent issues.

  2. peter says:

    Kevin,

    Patching MySQL is one option I mentioned another one which is just getting Application to use MySQL compression format, which is often more preferable. A lot of people are more open to patching their web apps than MySQL because in case of issues risks of crashing database are much more serious.

    Though I agree it would be good to get standard compression functions at least as addition.

  3. Hi Peter, Kevin

    Would it work to implement this using UDFs? With “would it work” I mean, would this be considered a serious solution for production purposes?

  4. Is there any documentation on using the MySQL compression format? I haven’t seen any. I was going to grok the source to see what was going on but couldn’t find the time to finish documenting anything.

    It looks like it’s just zlib with a mangled header. If someone has documentation on how to mangle the header from zlib I’ll OSS our java/zlib code to update the header information.

    Kevin

  5. peter says:

    Kevin,

    I do not know if you’re using perl/php/python it is easy enough to wrap around stuff over MySQL functions. For Java or C# you would like docs of course.

    Roland,

    I think having UDFs which do compression to match compression functions in application programming language would be rather usable.

  6. Java…. it’s easy to DO from Java but I need the zlib output to be compatible with MySQL. This same issue exists with perl/php/python.

  7. peter says:

    Well if it is not the problem go ahead and “copy paste” COMPRESS implementation from MySQL source :)

  8. Kevin,

    FYI, after reading this blog entry today, I was actually just compiling MySQL 5.0.51 on a new machine and look what message popped up.

    checking for zlib compression library… system-wide zlib not found, using one bundled with MySQL

    Interesting isn’t it?

  9. jim says:

    the COMPRESS() function uses the underlying zlib compress() function, but prepends four bytes with the original length of the data & 0x3FFFFFFF using the int4store() macro. for UNCOMPRESS(), it passes the compressed data after the first four bytes to the zlib uncompress() function.

  10. Hey Jim.

    Thanks for the clarification. How can compress() use a prefix of only 4 bytes since the compressed data could in theory be up to 2^32?

    The uncompress makes sense.

    What’s the reason for the change in format? So that LENGTH() works?

  11. Sergei Golubchik says:

    To be able to allocate buffers before uncompressing. And for UNCOMPRESSED_LENGTH() function.

    Besides, the manual

    http://dev.mysql.com/doc/refman/5.0/en/encryption-functions.html#function-compress

    says “The compressed string contents are stored the following way:
    * Empty strings are stored as empty strings.
    * Non-empty strings are stored as a four-byte length of the uncompressed string (low byte first), followed by the compressed string. If the string ends with space, an extra ‘.’ character is added to avoid problems with endspace trimming should the result be stored in a CHAR or VARCHAR column. (Use of CHAR or VARCHAR to store compressed strings is not recommended. It is better to use a BLOB column instead.)”

    Looks pretty clear to me, no need to guess or read the code.

  12. Sergei Golubchik says:

    Dug up old emails…

    ============
    Vva (the implementor):

    3. As I understand, Zlib doesn’t provide function for calculation of
    size of packet after uncomppress..
    ============
    Peter:

    I’ve used this library (also LZO/BZIP2) for my own application and I
    used a bit different approach – storing original size in the header.
    This allows to avoid uncompress in the loop which is really bad
    performance wise.
    Thus I did not needed compatibility with existing data, which we likely
    need in our case.
    ============
    Monty:

    No, we don’t need to be compatible in this case. We can define that
    our compressed string starts with a 4 byte length header + the
    compressed data. Anyone that needs to access the compress data can
    simply use RIGHT().

    The 2 highest bits should be masked when reading the data as we
    want to use this to store the compress method used.
    (This will limit compressed data to 1 G, but this is ok for now)
    ============

    At the moment, there’s only one “compress method used”, so two highest bits are always 0.

  13. peter says:

    Thanks Sergei for looking into it. So there is some extra data conversion with “.” added for data ending with space besides the size value. Anyway this should not make it hard to implement compatible compression for other languages so you can uncompress on the application when desired.

  14. Agreed. This is valuable information. I’m probably going to write an implementation for Java.

    Better yet…. it would be nice if the JDBC driver had a (de)compress BLOBs on the client option that was compatible with the internal MySQL compress/uncompress functions.

    I think I might just submit a patch for that when I get around to it.

    Kevin

  15. zlib says:

    conversely, il you happen to store zlib-compressed data in a blob and want to be able to decompress for a quick check, you can

    select uncompress(concat(x’00000100′,myblob)) from mytable where id=2121;

    (the value must be chosen so that it’s not too big but still can contain the output buffer)

  16. Deepesh Pathak says:

    I have test the compress function with my exiting table. i found that the rows column that have size 127B becomes 167B. Why this so. I am also confuse when we have to use compress/uncompress function of mysql and in which kind of application as space never matter but it really increases the performance??

Speak Your Mind

*