Real World CompressionTim.Callaghan
Benchmarking is a tricky thing, especially when it comes to compression. Some data compresses quite well while other data does not compress at all. Storing jpeg images in a BLOB column produces 0% compression, but storing the string “AAAAAAAAAAAAAAAAAAAA” in a VARCHAR(20) column produces extremely high (and unrealistic) compression numbers.
This week I was assisting a TokuDB customer understand the insertion performance of TokuDB versus InnoDB and MyISAM for their actual data. The table contained a single VARCHAR(50), multiple INTEGER, one SET, one DECIMAL, and a surrogate primary key. To support a varied query workload they needed 6 indexes.
Here is an obfuscated schema of the table:
col1 varchar(50) NOT NULL,
col2 int(40) NOT NULL DEFAULT '0',
col3 int(10) NOT NULL DEFAULT '0',
col4 int(10) NOT NULL DEFAULT '0',
col5 int(10) NOT NULL DEFAULT '0',
col6 set('val1', 'val2', ..., ‘val19’, 'val20',) NOT NULL DEFAULT '',
col7 int(10) DEFAULT NULL,
col8 decimal(32,0) DEFAULT NULL,
CLUSTERING KEY (col1,col2),
KEY lib_id (col2),
KEY rna_type (col6),
KEY hits (col5),
KEY max_norm (col7),
KEY norm_sum (col8)
And the on-disk file sizes after loading the data:
|Storage Engine||Size (GB)|
It’s important to note that one of the TokuDB indexes was defined as clustering, meaning that a second full copy of the table is stored. Clustering indexes are helpful in that they always satisfy a query when used; there is no need to retrieve non-indexed column values from the primary key index. Clustering indexes can make queries significantly faster. MyISAM and InnoDB do not support clustering secondary indexes.
TokuDB achieved 10.8x compression versus InnoDB. This is in line with other tests on compression and performance that we have demonstrated vs. InnoDB (see http://www.tokutek.com/2012/04/tokudb-v6-0-even-better-compression).
The benefits of high compression are much more than buying smaller disks (or SSDs). Disk IO (reads and writes) is 10.8x more efficient, plus backups that employ file system snapshots are 10.8x smaller.