Buy Percona ServicesBuy Now!

Effect from innodb log block size 4096 bytes

 | January 3, 2011 |  Posted In: Benchmarks, MySQL


In my post MySQL 5.5.8 and Percona Server: being adaptive I mentioned that I used innodb-log-block-size=4096 in Percona Server to get better throughput, but later Dimitri in his article MySQL Performance: Analyzing Percona’s TPCC-like Workload on MySQL 5.5 sounded doubt that it really makes sense. Here us quote from his article:

“Question: what is a potential impact on buffered 7MB/sec writes if we’ll use 4K or 512 bytes block size to write to the buffer?.. ;-))
There will be near no or no impact at all as all writes are managed by the filesystem, and filesystem will use its own block size.. – Of course the things may change if “innodb_flush_log_at_trx_commit=1″ will be used, but it was not a case for the presented tests..”

Well, sure you do not need to believe me, you should demand for real numbers. So I have number to show you.

I took Dell PowerEdge R900 server with 32GB of RAM and with FusionIO 320GB MLC card, and run tpcc-mysql benchmark with 500W using Percona Server 5.5.8.

Here is relevant part of config what I used

I made two runs, one with default innodb-log-block-size ( 512 bytes), and another with –innodb-log-block-size=4096. Full benchmark command is

From graph you can actually see, that there is quite significant impact when we use –innodb-log-block-size=4096.

The average throughput for last 15 mins in first run is 38090.66 NOTPM,
in second run it is 49130.13 NOTPM, that is increase is 1.28x, and I can’t say this is “near no or no impact”.

What is the cause of such difference ? I am not really sure. Apparently FusionIO driver is sensitive to IO block size. And I know that other SSD/Flash drives like to have IO multiplied to their internal block size (which is often 4096 bytes), but I do not know if the effect is the same as on FusionIO.

I put CPU usage graph ( user and system) for both cases:

You may see with 4096 block size USER and SYS CPU is utilized much better, meaning that IDLE is much lower.
Is this contention issue in FusionIO driver when we have 512 bytes IO ? It may be.

Also I am not sure what is strange hill on throughput line with 512 bytes, but it is quite repeatable.
My blind guess (but do not believe me, I have no proof) is that again something is going on inside FusionIO driver,
but this is topic for another research.

For history, FusionIO card information is

Vadim Tkachenko

Vadim Tkachenko co-founded Percona in 2006 and serves as its Chief Technology Officer. Vadim leads Percona Labs, which focuses on technology research and performance evaluations of Percona’s and third-party products. Percona Labs designs no-gimmick tests of hardware, filesystems, storage engines, and databases that surpass the standard performance and functionality scenario benchmarks. Vadim’s expertise in LAMP performance and multi-threaded programming help optimize MySQL and InnoDB internals to take full advantage of modern hardware. Oracle Corporation and its predecessors have incorporated Vadim’s source code patches into the mainstream MySQL and InnoDB products. He also co-authored the book High Performance MySQL: Optimization, Backups, and Replication 3rd Edition.


  • Why doesn’t inborn support raw disk volume for redo logs? It should be the most efficient way to do write when flush-log-commit=1?

  • Vadim,

    I’m curious what you would get with innodb_flush_log_at_trx_commit=1 I think this is where you have a question of many small aligned writes or not. This is also the issue of potential space wasted if blocks are large.

    Also in this case blocks are synced, no O_DIRECT here for logs, right ?

  • Vadim –

    Fusion-io internally can be provisioned with 512 byte sector size. Card has no “internal sector size” unlike most SSDs. When you run fio-format. you can say whether you want 512 bytes or 4K. Generally customers for database logs ioDrive use 512 byte sector size but for data files ioDrive use 4K block sizes. That is the optimal trade off to avoid having to waste log page space.

    If you are testing Fusion-io with 4K IOs, it’s best to fio-format to 4K sector size and driver will adjust data structures accordingly. Otherwise format to 512 byte sector size for logs.


  • Then you’d want to format to 512 byte sector size. The larger IOs will be a little less efficient, but it will work fine. Also recommend using O_DIRECT file IO (which requires sector size alignment).

  • Hi Vadim,

    while the result is here, I’m still not convinced :-))

    if I remember well, redo logs within InnoDB are not opened with O_DIRECT option.. – so all redo log writes are remaining buffered by the filesystem and then flushed by fsync() every second by InnoDB.. – means whatever block size you’re using the writes are aligned to the filesystem blocks.. – so it looks for me more likely filesystem related rather SSD..

    but many thanks for the presented results! – there is a good material for discussion ;-))


  • Hi Vadim,

    while the result is here, I’m still not convinced :-))

    if I remember well, redo logs within InnoDB are not opened with O_DIRECT option.. – so all redo log writes are remaining buffered by the filesystem and then flushed by fsync() every second by InnoDB.. – means whatever block size you’re using the writes are aligned to the filesystem blocks.. – so it looks for me more likely filesystem related rather SSD..

    but many thanks for the presented results! – there is a really good material for discussion ;-))


  • Dimitry,

    It is interesting what is the difference between innodb_flush_log_at_trx_commit=0 and =2 (you’re using) in this case. With value 2 you indeed do not sync() to the disk but you have to do writes to OS cache. If you’re writing in 512 byte blocks the code path could be somewhat different as you do not have the full page write, for example in case the page to be written to is not present in the OS cache you would have to read it before it is modified. Before we can conclude fusionIO does not like 512 byte blocks in this case we would need to take at lower level profiling and see what size of physical write is being issued, which is hard as logs and data are on the same partition in this case.

    I think it is clear the block level size makes a difference but it is not clear why.

  • Is this an issue for SAS/SATA with a HW RAID card? Or this only an issue for Fusion IO and perhaps other flash cards?

  • I’m a little fuzzy (seriously) on the internals of the flash devices these days, but aren’t they “whole cell” writes?

    So any write of, say, 512 bytes actually ends up reading and rewriting an entire cell of whatever the internal cell size is?

    Seems like this *might* be an artifact of testing on flash storage (but worth knowing in that case as well).

  • Vadim/Dimtry, If InnoDB doesn’t synchronously write log blocks upon a transaction COMMIT (or hold a number of transactions to commit together to fill a low block), isn’t there a window for losing already-committed transactions i.e. it’s not ACID-safe? Perhaps you can explain more about how this works.

  • nac,

    I believe this is controlled by:


    0 = write to the file once per second and fsync (can lose up to a seconds worth of data on os or mysqld crash)
    1 = write to the file system on each commit and fsync (no data loss window)
    2 = write to teh file on each commit but do not fsync (can lose up to a second of data on an os crash but no data loss on mysqld crash)

    Depending on your requirements, the loss of up to 1 seconds worth of data is often quite tolerable … after all the crash could have occurred 1 second earlier and you’d be in the exact same situation. For things like financial data though, you’ll want to run with innodb_flush_log_at_trx_commit=1 for full ACID compliance.

  • Thanks Patrick. So a couple more questions:

    If you run #1, and the transaction is very small (say updates only one or two columns(, will the remainder of each log sector be wasted for each transaction? Or will InnoDB be smart and still gather multiple transactions together to fill the log page, then write it, sync and then process the commits? If the latter – and it’s being smart – how long will it wait to fill the log block in the event of lower concurrency?



  • nac,

    I’m sorry to say but I don’t actually know what the internals of the innodb log file look like, so I’m not sure if there’s a lot of padding in there to pad every write up to a full log page.. Note that depending on how its written I don’t think “immediate” flush requires a lot of dead space in the file.

    You could, for example, have a 512 byte log page.
    First transaction writes bytes 0..99.
    Second transaction writes new page and rewrites 0..99, 100..199.
    First page goes back on the free list.
    Third transaction rewrites first page and writes 0..99, 100..199, 200..299
    Second page goes on the free list

    Note that I don’t actually know what they’re doing internally here, and I frankly doubt its as complex as page swapping like I described above, just pointing out that partial writes don’t necessarily lead to lots of dead space in your log file e.g. there are ways to deal with it.

    Perhaps somebody who knows more about the log file internals can chime in on how innodb actually does it.

  • @Peter: the dependence on the “filesystem page” is more realistic, but if it’s 8K you’ll still need to read whole 8K if the page is not yet cached and then write 4K.. – so there is a difference, but it’s unclear why..

    @nac: log writes are still safe, the only question if you may permit to loose some transactions in case of a crash of your server.. – however, even if you’re using innodb_flush_log_at_trx_commit=1 there will be:
    1. lock
    2. write()
    3. unlock

    4. fsync()

    so the write() is not direct, and transaction is committed after fsync(), but this single fsync() may flush several transactions which involved write() between steps 3. and 4. – and in this case a real write operation may be quite small.. – however with innodb_flush_log_at_trx_commit=2 (which was used during the test) fsync() is called only once per second(!), so having 7-8MB/s write speed from the filesystem it’s hard to claim that for SSD using 512 bytes buffered writes is worse than 4K buffered writes as before there will be any access to SSD there are first 7-MB of data came into the filesystem buffer and only then they are flushed to SSD.. – so it seems to me more FS related rather storage.. – well, need to be investigated ;-))


  • I’d attribute this effect to VM page size (which is usually 4k) – you don’t have to do read-modify-write pages, just pure writing (see )

  • Why doesn’t inborn support raw disk volume for redo logs? It should be the most efficient way to do write when flush-log-commit=1?

  • You can use fio-status -a to get the block size used by the fio device itself. By default the blocksize used on the ioDrive2 Duo’s I have is 512 bytes

  • It looks like someone removed the innodb log block size option in 5.7. Can someone please re-add it? It is crazy to switch back to only supporting a 512 byte block size when newer devices are dropping 512 byte support altogether.

  • Mark, Percona Server 5.7 has removed innodb_log_block_size option because it has been replaced by MySQL 5.7 innodb_log_write_ahead_size option

  • Laurynas, unfortunately this doesn’t seem to prevent MySQL from writing 512 byte blocks, so it fails on my NVMe drive which has a 4K minimum block size.

    I’ll give it another try to be 100% sure.

    • Please let us know the results of the second try – if it’s indeed writing 512 blocks with 4K innodb_log_write_ahead_size – then it’s a regression for Percona Server 5.7 against 5.6 that we should fix

      • Sorry for the slow response here! It is still failing.

        Device is an HGST SN260, which is 4K-only:

        # blockdev –report /dev/nvme1n1
        RO RA SSZ BSZ StartSec Size Device
        rw 256 4096 4096 0 3200631791616 /dev/nvme1n1

        When I add ALL_O_DIRECT and innodb_log_write_ahead_size=4096 and let the init script create the default data set, it bombs out.

        The first step appears to work: “Setting log file ./ib_logfile101 size to 8192 MB” – it creates an 8GB-sized file.

        However, it gives this error immediately after:

        Aug 15 11:06:40 dplsm2u-4 mysqld: 2017-08-15T17:06:40.078706Z 0 [ERROR] InnoDB: Write to file ./ib_logfile101failed at offset 0, 512 bytes should have been written, only 0 were written. Operating system error number 22. Check that your OS and file system support files of this size. Check also that the disk is not full or a disk quota exceeded.

        Thank you

        • Please file this on – if possible, please provide an strace log of the failing server instance

Comments are closed