Announcement

Announcement Module
Collapse
No announcement yet.

MySQL stops handling requests when restarting mysql on other nodes --- donor/desync

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • MySQL stops handling requests when restarting mysql on other nodes --- donor/desync

    In our cluster, a node will experience an issue from time to time. When this happens, nodes 2 and 3 will crash resulting in a
    Code:
     ERROR! MySQL (Percona XtraDB Cluster) is not running, but PID file exists
    If I restart mysql on the failed nodes, our Node 1 will no longer service mysql requests. Node 1 will show
    Code:
    | wsrep_local_state_comment  | Donor/Desynced
    until Node 2 and Node 3 receive updates. After this, MySQL is OK.

    This is an issue because I must wait until late at night to restart nodes 2 and 3 to allow our website to function.

  • #2
    I think you are having a split brain situation
    The rule is that after any kind of failure, a galera node will
    consider itself part of the primary partition if it can still see a
    majority of the nodes that were in the cluster before the failure.
    Majority > 50%.

    So if you have 3 nodes and one goes away, the 2 remaining are fine.
    If you have 3 nodes and 2 go away simultaneously, the 1 remaining must
    assume it is the one having problems and will go offline.

    If you have 3 nodes and 1 go way, then you have a 2 node cluster. This
    is not good. Now if any 1 node goes away, the other one alone is not
    in majority so it will have to be offline. The same is true if you
    have a 4 node cluster and simultaneously lose 2 nodes. Etc...

    But all is not lost. The node is still there, and if you as a human
    being know it is the right thing to do, then you can run some manual
    command to re-activate that node again (such as the command given by
    Haris, or just restart, etc...).


    There was a whole article on unknown commands and split brain situations on the perconas site but I cant seem to find it
    In order to restore the cluster, execute below command on the working node and it will establish this node to form the primary component again and restarting previously crashed nodes will join the cluster hopefully.
    mysql> SET GLOBAL wsrep_provider_options='pc.bootstrap=true';
    ariable wsrep_ready This variable shows if node is ready to accept queries. If status is OFF almost all the queries will fail with ERROR 1047(08S01) Unknown Command error (unless wsrep_on variable is set to 0)


    Comment


    • #3
      Hey zmahomedy,
      Thanks for the reply. Sorry, I worded it incorrectly.

      Node 1 always works serving mysql queries (r/w) until I restart the 2 dead nodes. The restart of the dead nodes is what prompts the good node to temporarily go offline to sync itself to the bad nodes. After a full SST is sent, I have a 3 node cluster once again.

      Comment


      • #4
        The question is why node2 and node3 are crashing. Usually when a node is restarted, full SST is not needed, just fast IST. So your nodes perhaps are shut down due to some inconsistency. The answer should be in their error logs.

        Comment


        • #5
          OK I think I have located the cuplrit. It looks like converting the MyISAM tables to InnoDB has caused the other nodes to crash. We have haproxy hitting node 1 primarily so it hits node 1 and then syncs the changes to node 2 and 3. The nodes 2 and 3 break after this and require a full SST. I made a new post here detailing the issue.
          Last edited by shockwavecs; 07-01-2014, 04:14 PM.

          Comment

          Working...
          X