This is the second in a series on what’s seriously limiting MySQL in certain circumstances (links: part 1). In the first part, I wrote about single-threaded replication. Upstream from the replicas is the primary, which enables replication by writing a so-called “binary log” of events that modify data in the server. The binary log is a real limitation in MySQL.
The binary log is necessary not only for replication, but for point-in-time recovery, too. Given a backup and the corresponding binary log position, you can replay the binary log and roll forward the state of your server to a desired point in time.
But enabling the binary log reduces MySQL’s performance dramatically. It is not the logging itself that’s the problem — writing the log is usually not much additional work. It’s ensuring consistency and durability that is expensive. Flushing it to disk adds an fsync call for every transaction. And the server performs an XA transaction between InnoDB and the binary log. This adds more fsync calls, and causes mutex contention, and prevents group commit, and probably other things that aren’t coming to mind now.
The performance reduction can be an order of magnitude or more.
What’s the solution? I’m not sure I can summarize it concisely. There is a lot of complexity, and honestly I don’t understand some of the server internals fully enough to have a 50-thousand-foot view of it all. The binary logging and replication code, and its interaction with InnoDB, is difficult to understand. Kristian Nielsen has an extensive series of posts on group commit alone.
I think that a full fix might require significant architectural changes to MySQL. This will be hard. Maybe Drizzle is going in a good direction — time will tell. All of the solutions that I can think of are too simplistic. For example, doing replication through the InnoDB transaction log would work fine if a) all the data were in InnoDB, and b) InnoDB’s data didn’t have to be synchronized with the .frm files (and Drizzle has gotten rid of the .frm files, hooray), and c) privileges and other changes to the non-InnoDB data in MySQL were handled manually.
It could work if you just made sure that you didn’t change privileges or schema, but that’s a description of a pretty limited, clunky replication system from the user’s point of view. Still, I have considered it. There would need to be a mechanism of transporting the log files, and InnoDB would have to be put into a state of constant “recovery,” and it would have to be modified to be available read-only in this state so that it could be used for read queries. This can be done, of course. It’s just a matter of how hard it is.
It’s worth noting that PBXT does replication through its transaction logs, so there’s even precedent for this among MySQL storage engines. And there is Galera’s multi-master synchronization technology to look at, too.