
One of our Remote DBA service clients recently had an issue with size on disk for a particular table; in short this table was some 25 million rows of application audit data with an on disk size of 345GB recorded solely for the purposes of debugging which may or may not occur.
Faced with the task of both reducing the size on disk and ensuring that the data is always available at all times (firmly ruling out off device archive), we began to explore the available options for the client.
We settled on the following options for further testing.
Tooling
The tooling phase consisted of producing scripts to generate representative payloads for comparison, in this case this would be some IMDB XML data sliced to sit within the VARCHAR(1000) fields per client requirements.
First off the schema tables.
|
1 |
CREATE TABLE `archive_table` (<br> `id` int(11) NOT NULL AUTO_INCREMENT,<br> `dat` varchar(1000) DEFAULT NULL,<br> PRIMARY KEY (`id`)<br>) ENGINE=ARCHIVE DEFAULT CHARSET=latin1;<br>CREATE TABLE `compressed_data` (<br> `id` int(11) NOT NULL AUTO_INCREMENT,<br> `dat` blob,<br> PRIMARY KEY (`id`)<br>) ENGINE=InnoDB DEFAULT CHARSET=latin1;<br>CREATE TABLE `compressed_row` (<br> `id` int(11) NOT NULL AUTO_INCREMENT,<br> `dat` varchar(1000) DEFAULT NULL,<br> PRIMARY KEY (`id`)<br>) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4;<br>CREATE TABLE `non_compressed_data` (<br> `id` int(11) NOT NULL AUTO_INCREMENT,<br> `dat` varchar(1000) DEFAULT NULL,<br> PRIMARY KEY (`id`)<br>) ENGINE=InnoDB DEFAULT CHARSET=latin1; |
As can be seen we have the following:
Now we needed to generate the ‘payload’ data for this I quickly authored a bash script for the task to provide an approximate 1M rows per table.
Scripts are available at GitHub
The script adventures_in_archiving_load_schema.sh will first drop and create the schema adventures_in_archiving.
It will then attempt to regenerate the payload file adventures_in_archiving_data.sql (4GB) if it does not exist in the current folder.
Subsequently loading the payload into the adventures_in_archiving schema, this is by no means a fast process however, taking some 4 minutes for data generation and a subsequent 63 minutes for import.
Results
| Size on disk | Engine | Compression Ratio |
| 7.1M | Archive | 173.07:1 |
| 304M | InnoDB row_format=compressed key_block_size=4 | 4.04:1 |
| 648M | compress() | 1.897:1 |
| 1.2G | InnoDB | 1:1 |
Immediately Archive seemed like the obvious choice there are however bugs to be taken into consideration..
Due diligence
What follows is a list reduced to the most likely to affect production deployments.
Summary
Archive in this instance is the solution to the issue of disk storage however you must always take into account the caveats and bugs related to a storage engine before going ahead and using it. Also you need to be aware that Archive does not have good performance.
e.g.
|
1 |
mysql> select id from archive_table order by id desc limit 1;<br>+---------+<br>| id |<br>+---------+<br>| 1143883 |<br>+---------+<br>1 row in set (2.04 sec)<br>mysql> select id from compressed_row order by id desc limit 1;<br>+---------+<br>| id |<br>+---------+<br>| 1143883 |<br>+---------+<br>1 row in set (0.00 sec) |
Also you must flush the buffer for example by running a select on the table before inserts are written to disk.
Followup
A few alternative suggestions have been made, as such there will be a follow up post taking a quick look at the size on disk for TokuDB and Infobright, soon.