Analyzing the distribution of InnoDB log file writes

I recently did a quick analysis of the distribution of writes to InnoDB’s log files. On a high-traffic commodity MySQL server running Percona XtraDB for a gaming workload (mostly inserts to the “moves” table), I used strace to gather statistics about how the log file writes are distributed in terms of write size. InnoDB writes to the log files in multiples of 512 bytes. Mark Callaghan explained this and some of its performance implications here. How many big writes does InnoDB do, and how many small writes?

First, I found out the file descriptor numbers of the log files:

The file descriptors are 8 and 9. We’ll need to capture writes to both of those; InnoDB round-robins through them. The following grabs the write sizes out of 100k calls to pwrite() and aggregates them:

I could have done it better with a little more shell scripting, but the output from this was enough for me to massage into a decent format with another step or two! Here is the final result:

So, in sum, we see that about 3/4ths of InnoDB log file writes on this workload are 512 or 1024 bytes. (It might vary for other workloads.) Now, what does this actually mean? There are a lot of interesting and complex things to think about here and research further:

  • Most writes are less than 4k, but the operating system page size is 4k. If the page isn’t in the OS buffer cache, then the write will cause a read-around write — the page will have to be read, modified, and then written again, instead of just written. Vadim did some benchmarks earlier showing that the log files need to be in the OS cache for best performance (sorry, can’t find the link right now). This makes intuitive sense given that the writes are smaller than 4k.
  • This server has innodb_flush_log_at_trx_commit set to 2. This means each transaction does a log file write, the same as for a setting of 1. If it’s set to 0, the distribution may be quite different — the writes might accumulate and become much larger.
  • What happens if the log buffer is actually smaller than a transaction? That’s another topic for research. I don’t know off the top of my head.
  • Is the distribution meaningful for determining the necessary log buffer size? The largest write above is just under 40k, which might seem to indicate that only 64k or so of log buffer would be large enough — but is it really? We’d need to benchmark and see. Peter and I were just talking, trying to remember some complex behavior of how space in the buffer is reserved for writes. We could not quite recall what that is, but there is some subtlety. This needs more research. It’s possible that even if the largest write is quite small, a small log buffer size wouldn’t be good and could cause additional lock contention. (There is a single mutex around the log buffer.) We probably need to benchmark to learn more about this.

In the end, the distribution is a simple observation to make, but the InnoDB redo log system is intricate. It’s not something to guess about, but rather something to measure and study more deeply. Perhaps we can follow this up with some more benchmarks or observations under different InnoDB settings and different workloads. Or then again, maybe Yasufumi will read this post when he returns from vacation and already know all the answers by heart!

Share this post

Comments (5)

  • Rob Wultsch Reply

    IIRC the current facebook patch set includes the ability to set innodb log writes to O_DIRECT. It seem like the flushing behavior of log writes should follow the behavior set by innodb_flush_method.

    July 17, 2010 at 12:26 am
  • Pierre Beck Reply

    Given the sequential nature of logs, most of the writes should be merged by the kernel. Possibly many merges in both innodb_flush_log_at_trx_commit mode 0 and 2. In mode 1, merges are very unlikely. Could you add merge counts to the benchmark? e.g. iostat -x -d 30, let it run for 30 secs, watch column wrqm/s.

    July 17, 2010 at 3:54 am
  • Baron Schwartz Reply

    Pierre, there was no benchmark, this was on a live system. The results from iostat would be meaningless, as the IO to the log files is mixed with the IO to the data files. It is possible that writes in mode 2 would be merged, but the small writes will still result in read-around writes, in theory. In mode 1, merges should not only be unlikely, but guaranteed not to happen, unless there are multiple calls to pwrite() before an fsync(). I have not investigated whether this is the case. All of these points are rather more difficult to measure than the data I did measure, and would require more research. Aggregated statistics such as those provided by the kernel are hard to use.

    July 19, 2010 at 5:59 am
  • Baron Schwartz Reply

    Rob, I think you are right. I have never liked that it’s not really possible to control the logs and data files independently. The legacy configuration options for innodb_flush_method are a hodgepodge of things that don’t give enough choices; some of the combinations make me scratch my head and wonder why anyone would ever want that.

    July 19, 2010 at 6:01 am
  • tom_fans Reply

    truth be told, i intend to translate your article at first, but it’s difficult for me to read your english article. so give up now !

    August 18, 2010 at 7:31 pm

Leave a Reply