Pretending to fix broken group commit

PREVIOUS POST
NEXT POST

The problem with broken group commit was discusses many times, bug report was reported 3.5 years ago and still not fixed in MySQL 5.0/5.1 (and most likely will not be in MySQL 5.1). Although the rough truth is this bug is very hard (if possible) to fix properly. In short words if you enable replication (log-bin) on server without BBU (battery backup unit) your InnoDB write performance in concurrent load drops down significantly.
We wrote also about it before, see “Group commit and real fsync” and “Group commit and XA“.

The problem is the InnoDB tries to keep the same order of transactions in binary logs and in transaction logs and acquires mutex to serialize writes to both logs. We basically propose to break this serialization – in XtraDB release3 (will be announced soon, you can take current version for testing from Launchpad) we introduce –innodb-unsafe-group-commit mode. There are results with this options vs without (results are in transactions per second, more is better, this is sysbench OLTP load).

I tested it on Dell PowerEdge R900 with RAID 10 in WriteThrough mode to emulate absence of BBU. With BBU you will not see this problem (all results will scale well) as internal RAID cache will accumulate changes and return fsync() call immediately without real syncing data in disk.

So what can be wrong if you run –innodb-unsafe-group-commit — as I said there is possibility that transactions in binary-logs will be in different order than in InnoDB transactional log. Why this is bad? For example if box crashes and InnoDB does recovery: transactions on slaves may be executed in different order — that is you MAY get slaves unsynchronized with master. Is performance benefit worth it? It’s up to you, but I think better to have this choice then do not have.

I do not urge to use –innodb-unsafe-group-commit, I propose to have BBU on your RAID. But if it appears you don’t have it, and write load on server is significant — it may worth to try –innodb-unsafe-group-commit.

PREVIOUS POST
NEXT POST

Comments

  1. says

    Really, why have two logs? This whole problem seems like an argument for replicating directly from the InnoDB journal, though I imagine it would be a substantial amount of work to add keys & column metadata. Requiring BBU or any other hardware for that matter is a non-starter in virtual environments like Amazon where you don’t control the actual machines.

  2. Sergei Golubchik says

    I don’t see how this could cause desynchronized slaves.
    Could you show it step-by-step ?

  3. Sergei Golubchik says

    Ah, okay. If you also break XA – then it’s possible.
    But without XA you can get desynchronized slaves even if you maintain strict commit order :)

  4. Mark Callaghan says

    InnoDB stays in sync with the binlog by getting a list of committed transactions from the binlog during crash recovery and doing a commit or rollback for in-doubt transactions (in-doubt == transaction was prepared but not committed). How does this option affect that?

  5. says

    Sergei,

    The possibility here is theoretical. We had done some stress test running many transactions which should break replication if they are replied in the wrong order… and things just work fine.

    It is just we’re not quite sure it will work in 100% cases.

  6. Sergei Golubchik says

    If you’re not sure, you may suspect that a specific sequence of actions will make slaves desynchronized. Such as “first transaction does this. second does that. first starts committing. syncs, and this very moment the second does this-and-that. We pull the plug…” and so on. I’m asking you to show this sequence.

    Because I don’t see how this could ever cause slave desynchronization, even theoretically. As far as I understand it *only* affect innodb hot backup.

  7. Nikola says

    Hello,

    thanks for the nice article. I have question for you, thats not related to this post. I am making an album script that I want it to have a feature where users can set different order of their photos and subalbums per album. Should I do it with two tables:
    album(id,name,path,col2,col3,col4,col5) and albumorder(id,photoorder,albumorder) or one table album(id,name,photoorder,albumorder,col2,col3,col4,col5)?

  8. Vadim says

    Dadiv,

    Yes, I saw your patch. We reviewed it but we actually can’t say if it is better or worse than our solution. As I understand your patch has the same danger for innobackup as our.

  9. says

    Hi Vadim,

    I don’t believe that my patch does have a problem with innobackup. It preserves the order of binlog and innodb commits, although it doesn’t preserve the order of innodb prepare and binlog. This isn’t a problem for crash recovery, since mysql scans the binlog and commits any pending prepares in the same order as commits in the binlog file. Any prepare that doesn’t have a corresponding commit in the binlog file is rolled back.

    Innobase has been annoyingly silent on whether innobackup cares about prepare/binlog ordering, but it seems unlikely. It would seem like a bug to me if innobackup treated prepares differently than mysql does itself during crash recovery.

  10. Vadim says

    Hi David,

    That’s actually problem – I can’t be sure in both patches. Both requires intensive production testing. That’s why we made our own – we know how it supposed to work (more or less), and actually is shorter :) But I believe we will consider it again and again.

Leave a Reply

Your email address will not be published. Required fields are marked *