Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Effect from innodb log block size 4096 bytes

January 3, 2011

Author

Vadim Tkachenko

Benchmarks

MySQL

Share this Post:

In my post MySQL 5.5.8 and Percona Server: being adaptive I mentioned that I used innodb-log-block-size=4096 in Percona Server to get better throughput, but later Dimitri in his article MySQL Performance: Analyzing Percona’s TPCC-like Workload on MySQL 5.5 sounded doubt that it really makes sense. Here us quote from his article:

“Question: what is a potential impact on buffered 7MB/sec writes if we’ll use 4K or 512 bytes block size to write to the buffer?.. ;-))
There will be near no or no impact at all as all writes are managed by the filesystem, and filesystem will use its own block size.. – Of course the things may change if “innodb_flush_log_at_trx_commit=1″ will be used, but it was not a case for the presented tests..”

Well, sure you do not need to believe me, you should demand for real numbers. So I have number to show you.

I took Dell PowerEdge R900 server with 32GB of RAM and with FusionIO 320GB MLC card, and run tpcc-mysql benchmark with 500W using Percona Server 5.5.8.

Here is relevant part of config what I used

innodb_buffer_pool_size=26G
innodb_data_file_path=ibdata1:10M:autoextend
innodb_file_per_table=1
innodb_flush_log_at_trx_commit=2
innodb_log_buffer_size=8M
innodb_log_files_in_group=2

innodb_log_file_size=4G

innodb_adaptive_checkpoint=keep_average

innodb_thread_concurrency=0
innodb_flush_method             = O_DIRECT

innodb_read_ahead = none
innodb_flush_neighbor_pages = 0

innodb_write_io_threads=16
innodb_read_io_threads=16
innodb_io_capacity=2000

innodb_buffer_pool_size=26G

innodb_data_file_path=ibdata1:10M:autoextend

innodb_file_per_table=1

innodb_flush_log_at_trx_commit=2

innodb_log_buffer_size=8M

innodb_log_files_in_group=2

innodb_log_file_size=4G

innodb_adaptive_checkpoint=keep_average

innodb_thread_concurrency=0

innodb_flush_method = O_DIRECT

innodb_read_ahead = none

innodb_flush_neighbor_pages = 0

innodb_write_io_threads=16

innodb_read_io_threads=16

innodb_io_capacity=2000

I made two runs, one with default innodb-log-block-size ( 512 bytes), and another with –innodb-log-block-size=4096. Full benchmark command is

tpcc_start localhost tpcc500 root "" 500 24 10 3600

1	tpcc_start localhost tpcc500 root "" 500 24 10 3600

From graph you can actually see, that there is quite significant impact when we use –innodb-log-block-size=4096.

The average throughput for last 15 mins in first run is 38090.66 NOTPM,
in second run it is 49130.13 NOTPM, that is increase is 1.28x, and I can’t say this is “near no or no impact”.

What is the cause of such difference ? I am not really sure. Apparently FusionIO driver is sensitive to IO block size. And I know that other SSD/Flash drives like to have IO multiplied to their internal block size (which is often 4096 bytes), but I do not know if the effect is the same as on FusionIO.

I put CPU usage graph ( user and system) for both cases:

You may see with 4096 block size USER and SYS CPU is utilized much better, meaning that IDLE is much lower.
Is this contention issue in FusionIO driver when we have 512 bytes IO ? It may be.

Also I am not sure what is strange hill on throughput line with 512 bytes, but it is quite repeatable.
My blind guess (but do not believe me, I have no proof) is that again something is going on inside FusionIO driver,
but this is topic for another research.

For history, FusionIO card information is

Found 1 ioDrive in this system
Fusion-io driver version: 2.2.0 build 82

fct0    Attached as 'fioa' (block device)
        Fusion-io ioDrive 320GB, Product Number:FS1-002-321-CS SN:10973
        ioDIMM3, PN:00119401203, Mfr:004, Date:20091118
        Firmware v5.0.5, rev 43674

Found 1 ioDrive in this system

Fusion-io driver version: 2.2.0 build 82

fct0 Attached as 'fioa' (block device)

Fusion-io ioDrive 320GB, Product Number:FS1-002-321-CS SN:10973

ioDIMM3, PN:00119401203, Mfr:004, Date:20091118

Firmware v5.0.5, rev 43674

0 0 votes

Article Rating

25 Comments

Oldest

Newest Most Voted

Admin

Peter Zaitsev

15 years ago

Vadim,

I’m curious what you would get with innodb_flush_log_at_trx_commit=1 I think this is where you have a question of many small aligned writes or not. This is also the issue of potential space wasted if blocks are large.

Also in this case blocks are synced, no O_DIRECT here for logs, right ?

nac

15 years ago

Vadim –

Fusion-io internally can be provisioned with 512 byte sector size. Card has no “internal sector size” unlike most SSDs. When you run fio-format. you can say whether you want 512 bytes or 4K. Generally customers for database logs ioDrive use 512 byte sector size but for data files ioDrive use 4K block sizes. That is the optimal trade off to avoid having to waste log page space.

If you are testing Fusion-io with 4K IOs, it’s best to fio-format to 4K sector size and driver will adjust data structures accordingly. Otherwise format to 512 byte sector size for logs.

Author

Vadim Tkachenko

15 years ago

nac,

Thank you for explanation.
What if we put both logs and data on the same card ?
Logs are 512b records and data is 16K pages.

nac

15 years ago

Then you’d want to format to 512 byte sector size. The larger IOs will be a little less efficient, but it will work fine. Also recommend using O_DIRECT file IO (which requires sector size alignment).

Dimitri

15 years ago

Hi Vadim,

while the result is here, I’m still not convinced :-))
(sorry..)

if I remember well, redo logs within InnoDB are not opened with O_DIRECT option.. – so all redo log writes are remaining buffered by the filesystem and then flushed by fsync() every second by InnoDB.. – means whatever block size you’re using the writes are aligned to the filesystem blocks.. – so it looks for me more likely filesystem related rather SSD..

but many thanks for the presented results! – there is a good material for discussion ;-))

Rgds,
-Dimitri

Dimitri

15 years ago

Hi Vadim,

while the result is here, I’m still not convinced :-))
(sorry..)

but many thanks for the presented results! – there is a really good material for discussion ;-))

Rgds,
-Dimitri

Admin

Peter Zaitsev

15 years ago

Dimitry,

It is interesting what is the difference between innodb_flush_log_at_trx_commit=0 and =2 (you’re using) in this case. With value 2 you indeed do not sync() to the disk but you have to do writes to OS cache. If you’re writing in 512 byte blocks the code path could be somewhat different as you do not have the full page write, for example in case the page to be written to is not present in the OS cache you would have to read it before it is modified. Before we can conclude fusionIO does not like 512 byte blocks in this case we would need to take at lower level profiling and see what size of physical write is being issued, which is hard as logs and data are on the same partition in this case.

I think it is clear the block level size makes a difference but it is not clear why.

Mark Callaghan

15 years ago

Is this an issue for SAS/SATA with a HW RAID card? Or this only an issue for Fusion IO and perhaps other flash cards?

Author

Vadim Tkachenko

15 years ago

Mark,

I have to test more, so far I see biggest effect on FusionIO.

Patrick Casey

15 years ago

I’m a little fuzzy (seriously) on the internals of the flash devices these days, but aren’t they “whole cell” writes?

So any write of, say, 512 bytes actually ends up reading and rewriting an entire cell of whatever the internal cell size is?

Seems like this *might* be an artifact of testing on flash storage (but worth knowing in that case as well).

nac

15 years ago

Vadim/Dimtry, If InnoDB doesn’t synchronously write log blocks upon a transaction COMMIT (or hold a number of transactions to commit together to fill a low block), isn’t there a window for losing already-committed transactions i.e. it’s not ACID-safe? Perhaps you can explain more about how this works.

Patrick Casey

15 years ago

nac,

I believe this is controlled by:

innodb_flush_log_at_trx_commit

0 = write to the file once per second and fsync (can lose up to a seconds worth of data on os or mysqld crash)
1 = write to the file system on each commit and fsync (no data loss window)
2 = write to teh file on each commit but do not fsync (can lose up to a second of data on an os crash but no data loss on mysqld crash)

Depending on your requirements, the loss of up to 1 seconds worth of data is often quite tolerable … after all the crash could have occurred 1 second earlier and you’d be in the exact same situation. For things like financial data though, you’ll want to run with innodb_flush_log_at_trx_commit=1 for full ACID compliance.

nac

15 years ago

Thanks Patrick. So a couple more questions:

If you run #1, and the transaction is very small (say updates only one or two columns(, will the remainder of each log sector be wasted for each transaction? Or will InnoDB be smart and still gather multiple transactions together to fill the log page, then write it, sync and then process the commits? If the latter – and it’s being smart – how long will it wait to fill the log block in the event of lower concurrency?

Thanks,

Patrick Casey

15 years ago

nac,

I’m sorry to say but I don’t actually know what the internals of the innodb log file look like, so I’m not sure if there’s a lot of padding in there to pad every write up to a full log page.. Note that depending on how its written I don’t think “immediate” flush requires a lot of dead space in the file.

You could, for example, have a 512 byte log page.
First transaction writes bytes 0..99.
Second transaction writes new page and rewrites 0..99, 100..199.
First page goes back on the free list.
Third transaction rewrites first page and writes 0..99, 100..199, 200..299
Second page goes on the free list

Note that I don’t actually know what they’re doing internally here, and I frankly doubt its as complex as page swapping like I described above, just pointing out that partial writes don’t necessarily lead to lots of dead space in your log file e.g. there are ways to deal with it.

Perhaps somebody who knows more about the log file internals can chime in on how innodb actually does it.

Dimitri

15 years ago

‘@Peter: the dependence on the “filesystem page” is more realistic, but if it’s 8K you’ll still need to read whole 8K if the page is not yet cached and then write 4K.. – so there is a difference, but it’s unclear why..

@nac: log writes are still safe, the only question if you may permit to loose some transactions in case of a crash of your server.. – however, even if you’re using innodb_flush_log_at_trx_commit=1 there will be:
1. lock
2. write()
3. unlock
…
4. fsync()

so the write() is not direct, and transaction is committed after fsync(), but this single fsync() may flush several transactions which involved write() between steps 3. and 4. – and in this case a real write operation may be quite small.. – however with innodb_flush_log_at_trx_commit=2 (which was used during the test) fsync() is called only once per second(!), so having 7-8MB/s write speed from the filesystem it’s hard to claim that for SSD using 512 bytes buffered writes is worse than 4K buffered writes as before there will be any access to SSD there are first 7-MB of data came into the filesystem buffer and only then they are flushed to SSD.. – so it seems to me more FS related rather storage.. – well, need to be investigated ;-))

Rgds,
-Dimitri

domas

15 years ago

I’d attribute this effect to VM page size (which is usually 4k) – you don’t have to do read-modify-write pages, just pure writing (see http://dom.as/2010/11/18/logs-memory-pressure/ )

Zhuchao

15 years ago

Why doesn’t inborn support raw disk volume for redo logs? It should be the most efficient way to do write when flush-log-commit=1?

Zhuchao

15 years ago

Why doesn’t inborn support raw disk volume for redo logs? It should be the most efficient way to do write when flush-log-commit=1?

Alan

13 years ago

You can use fio-status -a to get the block size used by the fio device itself. By default the blocksize used on the ioDrive2 Duo’s I have is 512 bytes

Mark

9 years ago

It looks like someone removed the innodb log block size option in 5.7. Can someone please re-add it? It is crazy to switch back to only supporting a 512 byte block size when newer devices are dropping 512 byte support altogether.

Laurynas Biveinis

9 years ago

Mark, Percona Server 5.7 has removed innodb_log_block_size option because it has been replaced by MySQL 5.7 innodb_log_write_ahead_size option

Mark

9 years ago

Laurynas, unfortunately this doesn’t seem to prevent MySQL from writing 512 byte blocks, so it fails on my NVMe drive which has a 4K minimum block size.

I’ll give it another try to be 100% sure.

kastauyra

9 years ago

Reply to Mark

Please let us know the results of the second try – if it’s indeed writing 512 blocks with 4K innodb_log_write_ahead_size – then it’s a regression for Percona Server 5.7 against 5.6 that we should fix

Mark

8 years ago

Reply to kastauyra

Sorry for the slow response here! It is still failing.

Device is an HGST SN260, which is 4K-only:

# blockdev –report /dev/nvme1n1
RO RA SSZ BSZ StartSec Size Device
rw 256 4096 4096 0 3200631791616 /dev/nvme1n1

When I add ALL_O_DIRECT and innodb_log_write_ahead_size=4096 and let the init script create the default data set, it bombs out.

The first step appears to work: “Setting log file ./ib_logfile101 size to 8192 MB” – it creates an 8GB-sized file.

However, it gives this error immediately after:

Aug 15 11:06:40 dplsm2u-4 mysqld: 2017-08-15T17:06:40.078706Z 0 [ERROR] InnoDB: Write to file ./ib_logfile101failed at offset 0, 512 bytes should have been written, only 0 were written. Operating system error number 22. Check that your OS and file system support files of this size. Check also that the disk is not full or a disk quota exceeded.

Thank you