Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Analyzing the distribution of InnoDB log file writes

July 17, 2010

Author

Baron Schwartz

MySQL

Percona Software

Share this Post:

I recently did a quick analysis of the distribution of writes to InnoDB’s log files. On a high-traffic commodity MySQL server running Percona XtraDB for a gaming workload (mostly inserts to the “moves” table), I used strace to gather statistics about how the log file writes are distributed in terms of write size. InnoDB writes to the log files in multiples of 512 bytes. Mark Callaghan explained this and some of its performance implications here. How many big writes does InnoDB do, and how many small writes?

First, I found out the file descriptor numbers of the log files:

# lsof -p $(pidof mysqld) | grep ib_log
mysqld  29772 mysql    8uW  REG                8,2   268435456   7143989 /var/lib/mysql/ib_logfile0
mysqld  29772 mysql    9uW  REG                8,2   268435456   7143993 /var/lib/mysql/ib_logfile1

# lsof -p $(pidof mysqld) | grep ib_log

mysqld 29772 mysql 8uW REG 8,2 268435456 7143989 /var/lib/mysql/ib_logfile0

mysqld 29772 mysql 9uW REG 8,2 268435456 7143993 /var/lib/mysql/ib_logfile1

The file descriptors are 8 and 9. We’ll need to capture writes to both of those; InnoDB round-robins through them. The following grabs the write sizes out of 100k calls to pwrite() and aggregates them:

# strace -f -p $(pidof mysqld) -e pwrite -s1 -xx 2>&1 
   | grep 'pwrite([89],' |head -n 100000 
   | awk '{writes[$5]++}END{for(w in writes){print w, " ", writes[w]}}'

# strace -f -p $(pidof mysqld) -e pwrite -s1 -xx 2>&1

| grep 'pwrite([89],' |head -n 100000

| awk '{writes[$5]++}END{for(w in writes){print w, " ", writes[w]}}'

I could have done it better with a little more shell scripting, but the output from this was enough for me to massage into a decent format with another step or two! Here is the final result:

bytes	count
512	44067
1024	30740
1536	15221
2048	7094
2560	1810
3072	570
3584	219
4096	112
4608	39
5120	23
5632	16
6144	15
6656	5
7168	3
7680	8
8192	2
8704	2
9216	1
9728	2
10240	1
10752	2
11264	1
11776	1
14848	1
15360	1
15872	2
16384	4
16896	4
17408	2
17920	2
18432	2
18944	8
19456	7
19968	4
20480	4
21504	1
22016	2
24064	1
40960	1

bytes count

512 44067

1024 30740

1536 15221

2048 7094

2560 1810

3072 570

3584 219

4096 112

4608 39

5120 23

5632 16

6144 15

6656 5

7168 3

7680 8

8192 2

8704 2

9216 1

9728 2

10240 1

10752 2

11264 1

11776 1

14848 1

15360 1

15872 2

16384 4

16896 4

17408 2

17920 2

18432 2

18944 8

19456 7

19968 4

20480 4

21504 1

22016 2

24064 1

40960 1

So, in sum, we see that about 3/4ths of InnoDB log file writes on this workload are 512 or 1024 bytes. (It might vary for other workloads.) Now, what does this actually mean? There are a lot of interesting and complex things to think about here and research further:

- Most writes are less than 4k, but the operating system page size is 4k. If the page isn’t in the OS buffer cache, then the write will cause a read-around write — the page will have to be read, modified, and then written again, instead of just written. Vadim did some benchmarks earlier showing that the log files need to be in the OS cache for best performance (sorry, can’t find the link right now). This makes intuitive sense given that the writes are smaller than 4k.

- This server has innodb_flush_log_at_trx_commit set to 2. This means each transaction does a log file write, the same as for a setting of 1. If it’s set to 0, the distribution may be quite different — the writes might accumulate and become much larger.

- What happens if the log buffer is actually smaller than a transaction? That’s another topic for research. I don’t know off the top of my head.

- Is the distribution meaningful for determining the necessary log buffer size? The largest write above is just under 40k, which might seem to indicate that only 64k or so of log buffer would be large enough — but is it really? We’d need to benchmark and see. Peter and I were just talking, trying to remember some complex behavior of how space in the buffer is reserved for writes. We could not quite recall what that is, but there is some subtlety. This needs more research. It’s possible that even if the largest write is quite small, a small log buffer size wouldn’t be good and could cause additional lock contention. (There is a single mutex around the log buffer.) We probably need to benchmark to learn more about this.

In the end, the distribution is a simple observation to make, but the InnoDB redo log system is intricate. It’s not something to guess about, but rather something to measure and study more deeply. Perhaps we can follow this up with some more benchmarks or observations under different InnoDB settings and different workloads. Or then again, maybe Yasufumi will read this post when he returns from vacation and already know all the answers by heart!

0 0 votes

Article Rating

5 Comments

Oldest

Newest Most Voted

Rob Wultsch

16 years ago

IIRC the current facebook patch set includes the ability to set innodb log writes to O_DIRECT. It seem like the flushing behavior of log writes should follow the behavior set by innodb_flush_method.

Pierre Beck

16 years ago

Given the sequential nature of logs, most of the writes should be merged by the kernel. Possibly many merges in both innodb_flush_log_at_trx_commit mode 0 and 2. In mode 1, merges are very unlikely. Could you add merge counts to the benchmark? e.g. iostat -x -d 30, let it run for 30 secs, watch column wrqm/s.

Author

Baron Schwartz

16 years ago

Pierre, there was no benchmark, this was on a live system. The results from iostat would be meaningless, as the IO to the log files is mixed with the IO to the data files. It is possible that writes in mode 2 would be merged, but the small writes will still result in read-around writes, in theory. In mode 1, merges should not only be unlikely, but guaranteed not to happen, unless there are multiple calls to pwrite() before an fsync(). I have not investigated whether this is the case. All of these points are rather more difficult to measure than the data I did measure, and would require more research. Aggregated statistics such as those provided by the kernel are hard to use.

Author

Baron Schwartz

16 years ago

Rob, I think you are right. I have never liked that it’s not really possible to control the logs and data files independently. The legacy configuration options for innodb_flush_method are a hodgepodge of things that don’t give enough choices; some of the combinations make me scratch my head and wonder why anyone would ever want that.