October 31, 2014

Effect from innodb log block size 4096 bytes

In my post MySQL 5.5.8 and Percona Server: being adaptive I mentioned that I used innodb-log-block-size=4096 in Percona Server to get better throughput, but later Dimitri in his article MySQL Performance: Analyzing Percona’s TPCC-like Workload on MySQL 5.5 sounded doubt that it really makes sense. Here us quote from his article:

“Question: what is a potential impact on buffered 7MB/sec writes if we’ll use 4K or 512 bytes block size to write to the buffer?.. ;-))
There will be near no or no impact at all as all writes are managed by the filesystem, and filesystem will use its own block size.. – Of course the things may change if “innodb_flush_log_at_trx_commit=1″ will be used, but it was not a case for the presented tests..”

Well, sure you do not need to believe me, you should demand for real numbers. So I have number to show you.

I took Dell PowerEdge R900 server with 32GB of RAM and with FusionIO 320GB MLC card, and run tpcc-mysql benchmark with 500W using Percona Server 5.5.8.

Here is relevant part of config what I used

I made two runs, one with default innodb-log-block-size ( 512 bytes), and another with –innodb-log-block-size=4096. Full benchmark command is

From graph you can actually see, that there is quite significant impact when we use –innodb-log-block-size=4096.

The average throughput for last 15 mins in first run is 38090.66 NOTPM,
in second run it is 49130.13 NOTPM, that is increase is 1.28x, and I can’t say this is “near no or no impact”.

What is the cause of such difference ? I am not really sure. Apparently FusionIO driver is sensitive to IO block size. And I know that other SSD/Flash drives like to have IO multiplied to their internal block size (which is often 4096 bytes), but I do not know if the effect is the same as on FusionIO.

I put CPU usage graph ( user and system) for both cases:

You may see with 4096 block size USER and SYS CPU is utilized much better, meaning that IDLE is much lower.
Is this contention issue in FusionIO driver when we have 512 bytes IO ? It may be.

Also I am not sure what is strange hill on throughput line with 512 bytes, but it is quite repeatable.
My blind guess (but do not believe me, I have no proof) is that again something is going on inside FusionIO driver,
but this is topic for another research.

For history, FusionIO card information is

About Vadim Tkachenko

Vadim leads Percona's development group, which produces Percona Clould Tools, the Percona Server, Percona XraDB Cluster and Percona XtraBackup. He is an expert in solid-state storage, and has helped many hardware and software providers succeed in the MySQL market.

Comments

  1. Zhuchao says:

    Why doesn’t inborn support raw disk volume for redo logs? It should be the most efficient way to do write when flush-log-commit=1?

  2. Vadim,

    I’m curious what you would get with innodb_flush_log_at_trx_commit=1 I think this is where you have a question of many small aligned writes or not. This is also the issue of potential space wasted if blocks are large.

    Also in this case blocks are synced, no O_DIRECT here for logs, right ?

  3. nac says:

    Vadim –

    Fusion-io internally can be provisioned with 512 byte sector size. Card has no “internal sector size” unlike most SSDs. When you run fio-format. you can say whether you want 512 bytes or 4K. Generally customers for database logs ioDrive use 512 byte sector size but for data files ioDrive use 4K block sizes. That is the optimal trade off to avoid having to waste log page space.

    If you are testing Fusion-io with 4K IOs, it’s best to fio-format to 4K sector size and driver will adjust data structures accordingly. Otherwise format to 512 byte sector size for logs.

    N

  4. Vadim says:

    nac,

    Thank you for explanation.
    What if we put both logs and data on the same card ?
    Logs are 512b records and data is 16K pages.

  5. nac says:

    Then you’d want to format to 512 byte sector size. The larger IOs will be a little less efficient, but it will work fine. Also recommend using O_DIRECT file IO (which requires sector size alignment).

  6. Dimitri says:

    Hi Vadim,

    while the result is here, I’m still not convinced :-))
    (sorry..)

    if I remember well, redo logs within InnoDB are not opened with O_DIRECT option.. – so all redo log writes are remaining buffered by the filesystem and then flushed by fsync() every second by InnoDB.. – means whatever block size you’re using the writes are aligned to the filesystem blocks.. – so it looks for me more likely filesystem related rather SSD..

    but many thanks for the presented results! – there is a good material for discussion ;-))

    Rgds,
    -Dimitri

  7. Dimitri says:

    Hi Vadim,

    while the result is here, I’m still not convinced :-))
    (sorry..)

    if I remember well, redo logs within InnoDB are not opened with O_DIRECT option.. – so all redo log writes are remaining buffered by the filesystem and then flushed by fsync() every second by InnoDB.. – means whatever block size you’re using the writes are aligned to the filesystem blocks.. – so it looks for me more likely filesystem related rather SSD..

    but many thanks for the presented results! – there is a really good material for discussion ;-))

    Rgds,
    -Dimitri

  8. Dimitry,

    It is interesting what is the difference between innodb_flush_log_at_trx_commit=0 and =2 (you’re using) in this case. With value 2 you indeed do not sync() to the disk but you have to do writes to OS cache. If you’re writing in 512 byte blocks the code path could be somewhat different as you do not have the full page write, for example in case the page to be written to is not present in the OS cache you would have to read it before it is modified. Before we can conclude fusionIO does not like 512 byte blocks in this case we would need to take at lower level profiling and see what size of physical write is being issued, which is hard as logs and data are on the same partition in this case.

    I think it is clear the block level size makes a difference but it is not clear why.

  9. Mark Callaghan says:

    Is this an issue for SAS/SATA with a HW RAID card? Or this only an issue for Fusion IO and perhaps other flash cards?

  10. Vadim says:

    Mark,

    I have to test more, so far I see biggest effect on FusionIO.

  11. Patrick Casey says:

    I’m a little fuzzy (seriously) on the internals of the flash devices these days, but aren’t they “whole cell” writes?

    So any write of, say, 512 bytes actually ends up reading and rewriting an entire cell of whatever the internal cell size is?

    Seems like this *might* be an artifact of testing on flash storage (but worth knowing in that case as well).

  12. nac says:

    Vadim/Dimtry, If InnoDB doesn’t synchronously write log blocks upon a transaction COMMIT (or hold a number of transactions to commit together to fill a low block), isn’t there a window for losing already-committed transactions i.e. it’s not ACID-safe? Perhaps you can explain more about how this works.

  13. Patrick Casey says:

    nac,

    I believe this is controlled by:

    innodb_flush_log_at_trx_commit

    0 = write to the file once per second and fsync (can lose up to a seconds worth of data on os or mysqld crash)
    1 = write to the file system on each commit and fsync (no data loss window)
    2 = write to teh file on each commit but do not fsync (can lose up to a second of data on an os crash but no data loss on mysqld crash)

    Depending on your requirements, the loss of up to 1 seconds worth of data is often quite tolerable … after all the crash could have occurred 1 second earlier and you’d be in the exact same situation. For things like financial data though, you’ll want to run with innodb_flush_log_at_trx_commit=1 for full ACID compliance.

  14. nac says:

    Thanks Patrick. So a couple more questions:

    If you run #1, and the transaction is very small (say updates only one or two columns(, will the remainder of each log sector be wasted for each transaction? Or will InnoDB be smart and still gather multiple transactions together to fill the log page, then write it, sync and then process the commits? If the latter – and it’s being smart – how long will it wait to fill the log block in the event of lower concurrency?

    Thanks,

    N

  15. Patrick Casey says:

    nac,

    I’m sorry to say but I don’t actually know what the internals of the innodb log file look like, so I’m not sure if there’s a lot of padding in there to pad every write up to a full log page.. Note that depending on how its written I don’t think “immediate” flush requires a lot of dead space in the file.

    You could, for example, have a 512 byte log page.
    First transaction writes bytes 0..99.
    Second transaction writes new page and rewrites 0..99, 100..199.
    First page goes back on the free list.
    Third transaction rewrites first page and writes 0..99, 100..199, 200..299
    Second page goes on the free list

    Note that I don’t actually know what they’re doing internally here, and I frankly doubt its as complex as page swapping like I described above, just pointing out that partial writes don’t necessarily lead to lots of dead space in your log file e.g. there are ways to deal with it.

    Perhaps somebody who knows more about the log file internals can chime in on how innodb actually does it.

  16. Dimitri says:

    @Peter: the dependence on the “filesystem page” is more realistic, but if it’s 8K you’ll still need to read whole 8K if the page is not yet cached and then write 4K.. – so there is a difference, but it’s unclear why..

    @nac: log writes are still safe, the only question if you may permit to loose some transactions in case of a crash of your server.. – however, even if you’re using innodb_flush_log_at_trx_commit=1 there will be:
    1. lock
    2. write()
    3. unlock

    4. fsync()

    so the write() is not direct, and transaction is committed after fsync(), but this single fsync() may flush several transactions which involved write() between steps 3. and 4. – and in this case a real write operation may be quite small.. – however with innodb_flush_log_at_trx_commit=2 (which was used during the test) fsync() is called only once per second(!), so having 7-8MB/s write speed from the filesystem it’s hard to claim that for SSD using 512 bytes buffered writes is worse than 4K buffered writes as before there will be any access to SSD there are first 7-MB of data came into the filesystem buffer and only then they are flushed to SSD.. – so it seems to me more FS related rather storage.. – well, need to be investigated ;-))

    Rgds,
    -Dimitri

  17. domas says:

    I’d attribute this effect to VM page size (which is usually 4k) – you don’t have to do read-modify-write pages, just pure writing (see http://dom.as/2010/11/18/logs-memory-pressure/ )

  18. Zhuchao says:

    Why doesn’t inborn support raw disk volume for redo logs? It should be the most efficient way to do write when flush-log-commit=1?

  19. Alan says:

    You can use fio-status -a to get the block size used by the fio device itself. By default the blocksize used on the ioDrive2 Duo’s I have is 512 bytes

Speak Your Mind

*