November 22, 2014

Blob Storage in Innodb

I’m running in this misconception second time in a week or so, so it is time to blog about it.
How blobs are stored in Innodb ? This depends on 3 factors. Blob size; Full row size and Innodb row format.

But before we look into how BLOBs are really stored lets see what misconception is about. A lot of people seems to think for standard (“Antelope”) format first 768 bytes are stored in the row itself while rest is stored in external pages, which would make such blobs really bad. I even seen a solution to store several smaller blobs or varchar fields which are when concatenated to get the real data. This is not exactly what happens :)

With COMPACT and REDUNDANT row formats (used in before Innodb plugin and named “Antelope” in Innodb Plugin and XtraDB) Innodb would try to fit the whole row onto Innodb page. At least 2 rows have to fit to each page plus some page data, which makes the limit about 8000 bytes. If row fits completely Innodb will store it on the page and not use external blob storage pages. For example 7KB blob can be stored on the page. However if row does not fit on the page, for example containing two 7KB blobs Innodb will have to pick some of them and store them in external blob pages. It however will keep at least 768 bytes from each of the BLOBs on the row page itself. With two of 7KB blobs we will have one blob stored on the page completely while another will have 768 bytes stored on the row page and the remainder at external page.

Such decision to store first 768 bytes of the BLOB may look strange, especially as MySQL internally has no optimizations to read portions of the blob – it is either read completely or not at all, so the 768 bytes on the row page is a little use – if BLOB is accessed external page will always have to be read. This decision seems to be rooted in desire to keep code simple while implementing initial BLOB support for Innodb – BLOB can have prefix index and it was easier to implement index BLOBs if their prefix is always stored on the row page.

This decision also causes strange data storage “bugs” – you can store 200K BLOB easily, however you can’t store 20 of 10K blobs. Why ? Because each of them will try to store 768 bytes on the row page itself and it will not fit.

Another thing to beware with Innodb BLOB storage is the fact external blob pages are not shared among the blobs. Each blob, even if it has 1 byte which does not fit on the page will have its own 16K allocated. This can be pretty inefficient so I’d recommend avoiding multiple large blobs per row when possible. Much better decision in many cases could be combine data in the single large Blob (and potentially compress it)

If all columns do not fit to the page completely Innodb will automatically chose some of them to be on the page and some stored externally. This is not clearly documented neither can be hinted or seen. Furthermore depending on column sizes it may vary for different rows. I wish Innodb would have some way to tune it allowing me to force actively read columns for inline store while push some others to external storage. May be one day we’ll come to implementing this in XtraDB :)

So BLOB storage was not very efficient in REDUNDANT (MySQL 4.1 and below) and COMPACT (MySQL 5.0 and above) format and the fix comes with Innodb Plugin in “Barracuda” format and ROW_FORMAT=DYNAMIC. In this format Innodb stores either whole blob on the row page or only 20 bytes BLOB pointer giving preference to smaller columns to be stored on the page, which is reasonable as you can store more of them. BLOBs can have prefix index but this no more requires column prefix to be stored on the page – you can build prefix indexes on blobs which are often stored outside the page.

COMPRESSED row format is similar to DYNAMIC when it comes to handling blobs and will use the same strategy storing BLOBs completely off page. It however will always compress blobs which do not fit to the row page, even if KEY_BLOCK_SIZE is not specified and compression for normal data and index pages is not enabled.

If you’re interested to learn more about Innodb row format check out this page in Innodb docs:

It is worth to note I use BLOB here in a very general term. From storage prospective BLOB, TEXT as well as long VARCHAR are handled same way by Innodb. This is why Innodb manual calls it “long columns” rather than BLOBs.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Mohamad says:

    @Peter, at which point does VARCHAR become LONG VARCHAR, and this gets treated the same as TEXT and BLOB? Is there a character threshold?

  2. Hi Peter,

    Thank you! I fell into the same misconceptions you described.
    With regard to COMPRESSED format: I’ve noticed no (or unnoticed) reduction in optimized table size for very large tables with long text columns when using KEY_BLOCK_SIZE=16.
    You say only the text that won’t fit into the row page is compressed?
    How is the text compressed? (any known algorithm?) Is the compression done per-text value, or does it build a dictionary by utilizing more than one value?

  3. Vince Gatto says:

    Peter,

    For applications that are very blob heavy, do most people seem to stick with MySQL? What are some of the popular alternatives you see when you’re out in the field?

  4. Also note for very large blogs, as far as I recall, innodb allocates memory for the entire blob to be returned to the user. So imagine a blob that is 1 GB in size and you have 25 conncurrent requests for such a sized blob, innodb will need to allocate 25 GB of memory to satisfy this query. See: http://mysqlinsights.blogspot.com/2009/01/mysql-blobs-and-memory-allocation.html

    The people at pbxt have created a solution for their storage engine using a streaming blob protocol, but this requires a change to mysql client as far as i know: http://www.blobstreaming.org/

  5. peter says:

    Shlomi,

    As I understand it is only TAIL which gets compressed. So if you have 10K blob (does not fit into the page) which can be compressed to 4K (will fit in the page) it will still use external page – 768 out of 10K will be stored on the page and when remaining 9K will be compressed to say 3K and stored on the 16K external page. Whenever 3K or 9K is used on 16K page does not really make a lot of difference.

    When it makes the difference is even larger blobs which can use less external pages. When compressed. Check this out:

    mysql> create table comptest(b mediumblob);
    Query OK, 0 rows affected (0.05 sec)
    mysql> insert into comptest values (repeat(‘a’,1000000));
    Query OK, 1 row affected (0.02 sec)

    mysql> show table status like “comptest” \G
    *************************** 1. row ***************************
    Name: comptest
    Engine: InnoDB
    Version: 10
    Row_format: Compact
    Rows: 1
    Avg_row_length: 1589248
    Data_length: 1589248
    Max_data_length: 0
    Index_length: 0
    Data_free: 0
    Auto_increment: NULL
    Create_time: 2010-02-10 09:20:46
    Update_time: NULL
    Check_time: NULL
    Collation: latin1_swedish_ci
    Checksum: NULL
    Create_options:
    Comment:
    1 row in set (0.00 sec)

    mysql> alter table comptest row_format=compressed;
    Query OK, 1 row affected (0.62 sec)
    Records: 1 Duplicates: 0 Warnings: 0

    mysql> show table status like “comptest” \G
    *************************** 1. row ***************************
    Name: comptest
    Engine: InnoDB
    Version: 10
    Row_format: Compressed
    Rows: 1
    Avg_row_length: 32768
    Data_length: 32768
    Max_data_length: 0
    Index_length: 0
    Data_free: 0
    Auto_increment: NULL
    Create_time: 2010-02-10 09:21:08
    Update_time: NULL
    Check_time: NULL
    Collation: latin1_swedish_ci
    Checksum: NULL
    Create_options: row_format=COMPRESSED
    Comment:
    1 row in set (0.00 sec)

    In this this is highly compressible long blob gets some 50 times smaller :)

    Also note this less than 1MB blob takes about 1.5M this is because space allocation for BLOB happens as well as for other innodb segments – allocation happens in 16K pages for a first few pages and when space is allocated in 1MB blobs. Which means “waste” can be a lot more than 16K.

    This is actually a strange decision for blobs as Blobs never grow and could be allocated to exact size. If Blob changes it is stored in a new space.

  6. peter says:

    Vince,

    Many people use MySQL for BLOB/TEXT fields to keep them accessible together with other data as well as to get MySQL backup and recovery, replication, ACID guarantees etc.

    If this is not needed using file storage can be better choice (especially when blobs are really files such as images, pdf documents etc). You can also use various noSQL storage solutions. BlobStreaming with PBXT is another interesting solution.

  7. peter says:

    Ivan,

    Right MySQL Allocates memory for a full blobs furthermore there may be several copies of blobs allocation and conversion. I would not store 1GB blobs in MySQL. I think some 64M is the practical top size limit while if you’re using more than 500K blobs I would really think whenever you’re making a right choice.

  8. Weird. This post does not show up on planetmysql.

  9. Tony says:

    I seem to be missing the point?? WHY, WHY, WHY would you save images in your DB when saving the link/URL to a simple varchar field and letting the OS do the work on the image directory is not MUCH more efficient?? BLOB, really?

  10. peter says:

    Tony,

    You’re right. Storing images in MySQL is in most cases not a good idea being it MyISAM or Innodb storage engine. I’m not sure how this post made you feel it is being recommended.

  11. paul says:

    Hi Peter,

    re: “If all columns do not fit to the page completely Innodb will automatically chose some of them to be on the page and some stored externally. This is not clearly documented neither can be hinted or seen.”

    Last year as a result of docs Bug#29042 (http://bugs.mysql.com/bug.php?id=29042) we did identify that the manual indeed implied some of the misconceptions to which you allude. We tried to make some improvements in the discussion. See:

    http://dev.mysql.com/doc/refman/5.1/en/innodb-file-space.html:

    “The maximum row length, except for variable-length columns (VARBINARY, VARCHAR, BLOB and TEXT), is slightly less than half of a database page. That is, the maximum row length is about 8000 bytes. LONGBLOB and LONGTEXT columns must be less than 4GB, and the total row length, including BLOB and TEXT columns, must be less than 4GB.

    If a row is less than half a page long, all of it is stored locally within the page. If it exceeds half a page, variable-length columns are chosen for external off-page storage until the row fits within half a page. For a column chosen for off-page storage, InnoDB stores the first 768 bytes locally in the row, and the rest externally into overflow pages. Each such column has its own list of overflow pages. The 768-byte prefix is accompanied by a 20-byte value that stores the true length of the column and points into the overflow list where the rest of the value is stored.”

    That doesn’t address newer storage formats provided by InnoDB Plugin, of source. Integrating Plugin information into the manual is a longer-term project still in progress.

    Very interesting article, thanks for writing it.

  12. Dani says:

    @Tony — Why store blobs in DB rather as opposed to plain files in a filesystem?

    I agree with you in general, but there are applications where storing tons of files in a filesystem is bad.

    I ran into problems having many thousands (millions?) of files, where I ran out of inodes in the filesystem I was using. Besides disk size, I did not want to have to worry about keeping track of inodes nor extra options to mkfs to increase inode counts whenever deploying a new server or increasing disk or logical volume size.
    If you split things into multiple subdirectories then you also have to ensure your application(s) maintain those directories — create new, delete old when empty, etc.
    Also doing an ‘ls’ on a huge directory will tie things up on your server.

    The more efficient thing to do is:
    A) create 1 table that has ONLY an id sequence column set as primary index, and 2nd column for the blob. This table could conceivably be in a different database if you so choose.

    B) All other metadata about this blob must be held in a separate table, and the id column used to associate rows (1-to-1 relationship).

    C) Do your hefty queries (with joins, complex where clauses, etc) ONLY against the metadata table, and when you actually need the data from the blob query just that single row from your blob table.

    What you achieve:
    1) A single large file in the filesystem for that 1 InnoDB table that holds the blob. Queries against this table will use the primary index, and will only pull from the actual table what you need.
    If disk space usage is important, use row compression and pay the CPU price on your DB server, or have your application compress/decompress upon INSERT/SELECT to pay that price on your app. server.

    2) A well-designed meta-data table will contain everything you need to know about that blob, and you could design things to even contain a field that is, say, the first 50 bytes of your blob — if you need that for indexing or searching.

    If you think you need multiple blob columns, then as Peter suggested, consider combining them into 1 blob and on the application side splitting up the blobs. That might not so easy either, so maybe use multiple “blob-only” tables that have just a few blob columns each — less than 10 he said was problematic.

  13. Tobias says:

    Thanks Peter,

    finally a clear statement on storing images as blobs. :-)
    I have been looking for this single-lined answer for quite a long time now…

    Tobias :-)

  14. Rolfen says:

    Another reason for using BLOBs vs files is, what I believe to be called “referencial integrity”, ie, when you delete a parent row, the data gets delete too (through foreign key cascading, or directly if the BLOB is stored in the same row). You don’t need to worry about keeping track of and deleting files in the filesystem.

Speak Your Mind

*