Announcement

Announcement Module
Collapse
No announcement yet.

cluster crashes on Node Crash

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • cluster crashes on Node Crash

    Hi

    I setup the latest copy of the xtradb cluster on 4 servers last night.

    The first machine contained an existing db of 2 Gb and was started with gcomm://

    I added two more machines to the cluster and both got synced (sst_mode=xtradbbackup)without any issues.

    This setup worked fine with decent amount of read\write load for a couple of hours.All writes were still being sent to the original DB server(master node)

    I then added another node which also synced with the cluster seemlessly

    Now since an even number of nodes is not recommended and because i wanted to test how the cluster responds when a node crashes,i killed the mysqld process on the 4th Node.

    Unfortunately this caused all sort of havoc.
    All the three other nodes went in a weird state and most sql commands stopped failing.
    even the "use dbname" sql query returned a command not found error


    Looking at the logs of all the nodes, i found that all nodes had lost connectivity to each other and were stuck ina state of trying to reconnect infinitely.
    pasting logs below from node 3

    121019 4:18:29 [Note] WSREP: Flow-control interval: [8, 16]
    121019 4:18:29 [Note] WSREP: Received NON-PRIMARY.
    121019 4:18:29 [Note] WSREP: New cluster view: global state: eb1d8efb-1967-11e2
    -0800-d9cf063d7dbe:86293, view# -1: non-Primary, number of nodes: 1, my index: 0
    , protocol version 2
    121019 4:18:29 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notifica
    tion.
    121019 4:18:31 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, 'tcp://0.0.
    0.0:4567') reconnecting to 4a29ea43-1968-11e2-0800-ef510190974f (tcp://xx.xx.xx.x2
    :4567), attempt 0
    121019 4:18:48 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, 'tcp://0.0.
    0.0:4567') reconnecting to d784b124-1970-11e2-0800-073823812a38 (tcp://xx.xx.xx.x2
    :4567), attempt 1380
    121019 4:19:16 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, 'tcp://0.0.
    0.0:4567') reconnecting to 4a29ea43-1968-11e2-0800-ef510190974f (tcp://xx.xx.xx.x2
    :4567), attempt 30
    121019 4:19:33 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, 'tcp://0.0.
    0.0:4567') reconnecting to d784b124-1970-11e2-0800-073823812a38 (tcp://xx.xx.xx.x2
    :4567), attempt 1410
    121019 4:19:57 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, 'tcp://0.0.
    0.0:4567') reconnecting to 3911cf04-1975-11e2-0800-0427b9d45b0b (tcp://xx.xx.xx.x1
    .212:4567), attempt 90
    121019 4:20:01 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, 'tcp://0.0.
    0.0:4567') reconnecting to 4a29ea43-1968-11e2-0800-ef510190974f (tcp://xx.xx.xx.x2
    :4567), attempt 60
    121019 4:20:18 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, 'tcp://0.0.
    0.0:4567') reconnecting to d784b124-1970-11e2-0800-073823812a38 (tcp://xx.xx.xx.x4
    :4567), attempt 1440
    121019 4:20:46 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, 'tcp://0.0.
    0.0:4567') reconnecting to 4a29ea43-1968-11e2-0800-ef510190974f (tcp://xx.xx.xx.x2
    :4567), attempt 90
    121019 4:21:03 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, 'tcp://0.0.
    0.0:4567') reconnecting to d784b124-1970-11e2-0800-073823812a38 (tcp://xx.xx.xx.x4
    :4567), attempt 1470
    121019 4:21:31 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, 'tcp://0.0.
    0.0:4567') reconnecting to 4a29ea43-1968-11e2-0800-ef510190974f (tcp://xx.xx.xx.x2
    :4567), attempt 120
    121019 4:21:48 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, 'tcp://0.0.
    0.0:4567') reconnecting to d784b124-1970-11e2-0800-073823812a38 (tcp://xx.xx.xx.x4
    :4567), attempt 1500
    121019 4:21:57 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, 'tcp://0.0.
    0.0:4567') reconnecting to 3911cf04-1975-11e2-0800-0427b9d45b0b (tcp://xx.xx.xx.x1
    :4567), attempt 120
    121019 4:22:16 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, 'tcp://0.0.
    0.0:4567') reconnecting to 4a29ea43-1968-11e2-0800-ef510190974f (tcp://xx.xx.xx.x2
    :4567), attempt 150
    121019 4:22:33 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, 'tcp://0.0.
    0.0:4567') reconnecting to d784b124-1970-11e2-0800-073823812a38 (tcp://xx.xx.xx.x4
    :4567), attempt 1530
    121019 4:23:01 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, 'tcp://0.0.
    0.0:4567') reconnecting to 4a29ea43-1968-11e2-0800-ef510190974f (tcp://xx.xx.xx.x2
    :4567), attempt 180
    121019 4:23:18 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, 'tcp://0.0.
    0.0:4567') reconnecting to d784b124-1970-

    I tried restarting all nodes but that did not help.

    In the end i had to reinitialize the original master by using gcomm://

    This isnt somethign that we will expect from a cluster.If the crash of a node crashes the whole cluster, then it takes away the need of a cluster.

    What could be the cause of this

    I use Ubuntu 12 and installed all the software ust apt-get directly from percona repository.
    All the 4 servers were setup within the course of 1-2 hours


    copy of my.cnf

    [mysqld]
    datadir=/var/lib/mysql
    wsrep_provider=/usr/lib/libgalera_smm.so
    wsrep_cluster_address=gcomm://xx.xx.xx.x2
    wsrep_slave_threads=8
    #wsrep_sst_method=rsync
    wsrep_sst_method=xtrabackup
    wsrep_replicate_myisam=1
    wsrep_cluster_name=my_db_cluster
    wsrep_node_name=web1
    binlog_format=ROW
    default_storage_engine=InnoDB
    innodb_autoinc_lock_mode=2
    innodb_locks_unsafe_for_binlog=1
    wsrep_sst_auth=root:XXXXXddXX
Working...
X