How Does Semisynchronous MySQL Replication Work?Baron Schwartz
With the recent release of Percona XtraDB Cluster, I am increasingly being asked about MySQL’s semi-synchronous replication. I find that there are often a number of misconceptions about how semi-synchronous replication really works. I think it is very important to understand what guarantees you actually get with semi-synchronous replication, and what you don’t get.
The first thing to understand is that despite the name, semi-synchronous replication is still asynchronous. Semi-synchronous is actually a pretty bad name, because there is no strong coupling between a commit on the master and a commit on the replicas. To understand why, let’s look at what truly synchronous replication means. In truly synchronous replication, when you commit a transaction, the commit does not complete until all replicas have also committed successfully. In MySQL’s semi-synchronous replication, the commit completes before the transaction is even sent to any of the replicas. Therefore, by definition the transaction cannot have committed on any of the replicas. If there’s any problem after the commit happens on the master, it’s possible that the replicas won’t get the transaction, and even after they do, there’s no guarantee they can apply and commit it successfully themselves (duplicate key error, anyone?). If any of these problems happens, it’s too late–the commit is already permanent on the master, and can’t be rolled back.
What should semi-synchronous replication be called instead? I believe that it should be called delayed-acknowledgment commits, because this is what actually happens. When a transaction commits on the master, the commit proceeds as normal, and the transaction is sent to the replicas as normal, but the client connection to the master is not told that the commit has completed until after at least one replica has acknowledged receiving the transaction.
Another way to look at the same thing is that semi-synchronous replication actually forces the client to be synchronized, not the replicas. The client is forced to wait until the transaction has been sent to one of the replicas, but the commit on the master is not forced to wait at all, nor are replicas forced to do anything. The commit has already happened on the master, so the cat’s out of the bag and there’s no way to force replicas to do anything. As a result, the effect is that the client’s activity is throttled so that it cannot outpace the replica’s ability to fetch updates from the master. Have you seen the bumper sticker that says “don’t drive faster than your Guardian Angel can fly?” That is the effect of this throttling.
Semi-synchronous replication also does not guarantee that your replicas will not become delayed. The client connection is forced to wait until at least one of the replicas has retrieved the transaction, but not until the transaction has actually been applied to the replica. As you probably know, it is perfectly possible to send a very long transaction to the replica in a matter of milliseconds. The replica will take a long time to apply this transaction to its own data, and during that time, it will be delayed relative to the master. However, other transactions can continue committing and sending their changes to the replica, because the process of retrieving changes from the master and applying them run in separate threads on the replica.
Finally, semi-synchronous replication does not provide strong guarantees against data loss. What do I mean by a strong guarantee against data loss? I consider the safety of my data strongly guaranteed when at least one other server must have a copy of the data before it can be committed on the master. However, that is not what happens in semi-synchronous replication. And if there is an error in semi-synchronous replication, such as a crash at the wrong moment, or a timeout, then even the throttling is abandoned, and everything defaults back to the traditional mode of replication.
What does semi-synchronous replication guarantee me then? If there are no errors or timeouts, then the guarantee is essentially that only one transaction per client is likely to be lost if the master crashes.
I do not mean to sound negative, or to send the message that semi-synchronous replication is not useful. It is useful, but if you misunderstand it, you could be relying on a strong guarantee that is not actually provided.
If you want to learn more about this, then I encourage you to read the relevant section of the MySQL manual. But read carefully, for example, the following sentences:
When a commit returns successfully, it is known that the data exists in at least two places (on the master and at least one slave). If the master commits but a crash occurs while the master is waiting for acknowledgment from a slave, it is possible that the transaction may not have reached any slave.
Finally, I would be interested to hear how many people are actually running semi-synchronous replication in production. I have a feeling that very few people are, even though a lot of people seem to have heard about it. What are your experiences?