September 1, 2014

Innodb Double Write

One of very interesting techniques Innodb uses is technique called “doublewrite” It means Innodb will write data twice when it performs table space writes – writes to log files are done only once.

So why doublewrite is needed ? It is needed to archive data safety in case of partial page writes. Innodb does not log full pages to the log files, but uses what is called “physiological” logging which means log records contain page number for the operation as well as operation data (ie update the row) and log sequence information. Such logging structure is geat as it require less data to be written to the log, however it requires pages to be internally consistent. It does not matter which page version it is – it could be “current” version in which case Innodb will skip page upate operation or “former” in which case Innodb will perform update. If page is inconsistent recovery can’t proceed.

Now lets talk a bit about partial page writes – what are they and why are they happening. Partial page writes is when page write request submited to OS completes only partially. For example out of 16K Innodb page only first 4KB are updated and other parts remain in their former state. Most typically partial page writes happen when power failure happens. It also can happen on OS crash – there is a chance operation system will split your 16K write into several writes and failure happens just between their execution. Reasons for splitting could be file fragmentation – most file systems use 4K block sizes by default so 16K could use more than one fragment. Also if software RAID is used page may come on the stripe border requiring multiple IO requests. Same happens with Hardware RAID on power failure if it does not have battery backed up cache. If there is single write issued to the disk itself it should be in theory completed even if power goes down as there should be enough power accomulated inside the drive to complete it. I honestly do not know if this is always the case – it is hard to check as it is not the only reason for partial page writes. I just know they tend to happen and before Innodb doublewirite was implemented I had couple of data corruptions due to it.

So how does double write works ? You can think about it as about one more short term log file allocated inside Innodb tablespace – it contains space for 100 pages. When Innodb flushes pages from Innodb buffer pool it does so by multiple pages. So several pages will be written to double write buffer (sequentially), fsync() called to ensure they make it to the disk, then pages written to their real location and fsync() called the second time. Now on recovery Innodb checks doublewrite buffer contents and pages in their original location. If page is inconsistent in double write buffer it is simply discarded, if it is inconsistent in the tablespace it is recovered from double write buffer.

How much does double write buffer affect MySQL Performance ? Even though double write requires each page written twice its overhead is far less than double. Write to double write buffer is sequential so it is pretty cheap. It also allows Innodb to save on fsync()s – instead of calling fsync() for each page write Innodb submits multiple page writes and calls fsync() which allows Operating System to optimize in which order writes are executed and use multiple devices in parallel. This optimization could be used without doublewrite though, it was just implemented at the same time. So in general I would expect no more than 5-10% performance loss due to use of doublewrite.

Can you disable doublewrite ? If you do not care about your data (ie slaves on RAID0) or if your file system guarantees you no partial page writes could exist you can disable doublewrite by setting innodb_doublewrite=0 It is however not worth the trouble in most cases.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Vadim says:

    markgoat.

    Let me try to answer your question.

    The problem comes from fact that when we issue pwrite , and there is crash during
    this operation, then there is no way to check state of operation.
    It may happen that for 16K operation we wrote only 4K or 8K.

    So we may end up with situation when half of page contains new information and another half – old information. InnoDB of course will detect corruption using checksum, but it won’t help much, as page is broken.

    So solution to this is to have 2 copies of page. if we crash during writing of one of copies – we always have another consistent copy, which we can work with.

  2. Ratheesh Kaniyala says:

    Hi Baron,

    Could you please help me understand this a bit better. So is this double write buffer a special file on the disk and can we see this file on the File system using a ls cmd?
    Or is this just a space withing the ibdata1 and/or ib_logfile? Because when you say it is useful during recovery then it has to be a file within the Mysql space.


    Ratheesh

  3. Baron Schwartz says:

    The redo logs in InnoDB don’t have complete page images. This is different from some other database servers.

  4. markgoat says:

    Sorry, I am still not understand why we need this.
    Each entry in the log should be redo when doing recovery? I think even the page is “partial written”, it doesn’t matter, redo log know the after image. So regardless of the page itself, redo will do every change again. If part of the page are not wrtten due to a crash, redo will do the write? In other words, the “consistent page” is saved in the redo log. Unless you mean “partial write” will modify someting not saved in the redo log??
    I mean, EVERY change are in the redo log entries. Partial write , in my mind, is that it failed to make some changes to the disk. But when we apply redo log, we will redo the planned actions.
    For example, we have 4 blocks each page, we update 2 rows in two of the blocks, and then we have two entries in the redo log. then there is a commit log. As a result of Partial write, only block 1 was updated, block 2,3,4 did not. But I didn’t see any issue in this case, because redo will reimage block 1 and block 3 correctly? We don’t care about if the disk is consistent or not, this is what redo in theory to handle, isn’t it?
    So I think I still didn’t understand innoDB , why it need “double write”? I am new to innoDB, not knowing any internal details, I just think this from a textbook point of view ^_^

    thank you, it is so long time you post this great article, don’t know if you can still answer me. I am thinking this for days, but can’t understand yet.

  5. Baron Schwartz says:

    Zonker, the doublewrite buffer works as kind of a two-phase commit. It allows recovering a failed write. The write either succeeded or not, and if it didn’t, on recovery it will be replayed from either the doublewrite buffer or the redo logs. It doesn’t work the way you think, and the scenario you listed can’t happen.

  6. Nick Peirson says:

    Zonker,

    As I understand it there are two failure modes:

    1. A write to the doublewrite buffer fails. In this case the data in the table hasn’t changed, so the redo log can be applied to update the data.

    2. If the write to the doublewrite buffer succeeds and the write to the table fails, the data can be read from the doublewrite buffer to update the table.

    If we were writing directly to the table, the failed write would’ve left the table in an inconsistent state where the redo log couldn’t be applied and there’s been no successful write to a buffer that we can read the data back from. The doublewrite buffer means that we always have a way of bringing the data up to date after a failed write, regardless of failure mode.

    I haven’t looked at the internals, so this is an educated guess based on the post and comments. I’d be grateful if someone more knowledgeable could confirm.

  7. Zonker Harris says:

    I guess I’m not buying it.

    I write to the doublewrite buffer: AAAAAAAAAA

    then flush to the table: AAAAAAAAAA

    fine, all’s fine. But now I write again to the doublewrite buffer and the write fails halfway:

    BBBBB…..

    We “recover” and find that the doublewrite buffer is BBBBB….., with a bad checksum, or however we determine that it isn’t complete, so we “just discard it” per the mysql docs.

    Really? This is good?

    Now I’ve got AAAAAAAAAA in the table where I expect to have BBBBBBBBBB. AAAAAAAAAA maybe “consistent”, but it isn’t “correct”, and I stil have a problem. If that data is correlated to other data, I still have to recover from a backup to get back to true consistency, i.e. my data all being consistent with each other.

    So I don’t really see how this benefits me.

    To take the example further, if doublewrite is good, why wouldn’t triplewrite or fourplewrite be even better? Answer: It isn’t, for exactly the reasons I’ve described above. You still have bad data. “Consistent”, perhaps, but not correct.

    And now the obligatory “Or am I simply not understanding this?”

  8. Toby says:

    Could a checksum achieve the same end?

  9. peter says:

    Toby,

    Checksums are also used. Checksums allow to find which pages are corrupted, ie partially written but checksums can’t help to recover page if it was corrupted.

  10. k2s says:

    It is greate article, but it is difficult to find because you are not mentioning any of the keyword skip-innodb_doublewrite and innodb_doublewrite. Maybe this comment will make it available in search.

  11. btw: if MySQL is running on ZFS then you can safely disable innodb doublewrites as ZFS always guarantees that either entire write completes or nothing is updated.

  12. qihua says:

    nice article. But why oracle doesn’t need it?

  13. peter says:

    There are different techniques which can be used instead of double write buffer.

  14. markgoat says:

    Sorry, I am still not understand why we need this.
    Each entry in the log should be redo when doing recovery? I think even the page is “partial written”, it doesn’t matter, redo log know the after image. So regardless of the page itself, redo will do every change again. If part of the page are not wrtten due to a crash, redo will do the write? In other words, the “consistent page” is saved in the redo log. Unless you mean “partial write” will modify someting not saved in the redo log??
    I mean, EVERY change are in the redo log entries. Partial write , in my mind, is that it failed to make some changes to the disk. But when we apply redo log, we will redo the planned actions.
    For example, we have 4 blocks each page, we update 2 rows in two of the blocks, and then we have two entries in the redo log. then there is a commit log. As a result of Partial write, only block 1 was updated, block 2,3,4 did not. But I didn’t see any issue in this case, because redo will reimage block 1 and block 3 correctly? We don’t care about if the disk is consistent or not, this is what redo in theory to handle, isn’t it?
    So I think I still didn’t understand innoDB , why it need “double write”? I am new to innoDB, not knowing any internal details, I just think this from a textbook point of view ^_^

    thank you, it is so long time you post this great article, don’t know if you can still answer me. I am thinking this for days, but can’t understand yet.

  15. The redo logs in InnoDB don’t have complete page images. This is different from some other database servers.

  16. Vadim says:

    markgoat.

    Let me try to answer your question.

    The problem comes from fact that when we issue pwrite , and there is crash during
    this operation, then there is no way to check state of operation.
    It may happen that for 16K operation we wrote only 4K or 8K.

    So we may end up with situation when half of page contains new information and another half – old information. InnoDB of course will detect corruption using checksum, but it won’t help much, as page is broken.

    So solution to this is to have 2 copies of page. if we crash during writing of one of copies – we always have another consistent copy, which we can work with.

  17. Zonker Harris says:

    I guess I’m not buying it.

    I write to the doublewrite buffer: AAAAAAAAAA

    then flush to the table: AAAAAAAAAA

    fine, all’s fine. But now I write again to the doublewrite buffer and the write fails halfway:

    BBBBB…..

    We “recover” and find that the doublewrite buffer is BBBBB….., with a bad checksum, or however we determine that it isn’t complete, so we “just discard it” per the mysql docs.

    Really? This is good?

    Now I’ve got AAAAAAAAAA in the table where I expect to have BBBBBBBBBB. AAAAAAAAAA maybe “consistent”, but it isn’t “correct”, and I stil have a problem. If that data is correlated to other data, I still have to recover from a backup to get back to true consistency, i.e. my data all being consistent with each other.

    So I don’t really see how this benefits me.

    To take the example further, if doublewrite is good, why wouldn’t triplewrite or fourplewrite be even better? Answer: It isn’t, for exactly the reasons I’ve described above. You still have bad data. “Consistent”, perhaps, but not correct.

    And now the obligatory “Or am I simply not understanding this?”

  18. Zonker,

    As I understand it there are two failure modes:

    1. A write to the doublewrite buffer fails. In this case the data in the table hasn’t changed, so the redo log can be applied to update the data.

    2. If the write to the doublewrite buffer succeeds and the write to the table fails, the data can be read from the doublewrite buffer to update the table.

    If we were writing directly to the table, the failed write would’ve left the table in an inconsistent state where the redo log couldn’t be applied and there’s been no successful write to a buffer that we can read the data back from. The doublewrite buffer means that we always have a way of bringing the data up to date after a failed write, regardless of failure mode.

    I haven’t looked at the internals, so this is an educated guess based on the post and comments. I’d be grateful if someone more knowledgeable could confirm.

  19. Zonker, the doublewrite buffer works as kind of a two-phase commit. It allows recovering a failed write. The write either succeeded or not, and if it didn’t, on recovery it will be replayed from either the doublewrite buffer or the redo logs. It doesn’t work the way you think, and the scenario you listed can’t happen.

  20. Ian says:

    If this buffer has a capacity of ~2MB, how is a TEXT/BLOB that is larger than that size handled?

Speak Your Mind

*