Announcement

Announcement Module
Collapse
No announcement yet.

Cluster down with 1/3 node down

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cluster down with 1/3 node down

    Hi,

    We installed and configured a cluster of 3 nodes. The synchronization is good but when I stop mysql on one node, all nodes are desynchronized and don't accept new connections.

    ==================== Configuration of galera: ====================
    wsrep_provider=/usr/lib/libgalera_smm.so
    wsrep_cluster_name="db_cluster"
    wsrep_slave_threads=12
    wsrep_certify_nonPK=1
    wsrep_max_ws_rows=131072
    wsrep_max_ws_size=1073741824
    wsrep_debug=0
    wsrep_convert_LOCK_to_trx=0
    wsrep_retry_autocommit=1
    wsrep_auto_increment_control=1
    wsrep_replicate_myisam=1
    wsrep_drupal_282555_workaround=0
    wsrep_causal_reads=0
    wsrep_sst_method=rsync

    server-id=3
    wsrep_node_address=192.168.10.3
    wsrep_cluster_address="gcomm://"
    wsrep_provider_options="pc.weight=0; gcache.size=8G; evs.keepalive_period=PT3S; evs.inactive_check_period=PT10S; evs.suspect_timeout=PT30S; evs.inactive_timeout=PT1M; evs.consensus_timeout=PT1M; evs.send_window=1024; evs.user_send_window=512;"

    ================================================== =========

    Can you help us please ?

    EDIT :

    To add some information, here is the log I get on one of the desynchronised node (mysql still running) :

    2014-02-05 16:02:05 19183 [Note] WSREP: view(view_id(NON_PRIM,e7516d17-8e6a-11e3-b85c-6a6eb0de5350,2) memb {
    e7516d17-8e6a-11e3-b85c-6a6eb0de5350,0
    } joined {
    } left {
    } partitioned {
    fc04cf52-8e6a-11e3-b0f9-93a4b1f2a1d9,0
    })
    2014-02-05 16:02:05 19183 [Note] WSREP: view(view_id(NON_PRIM,e7516d17-8e6a-11e3-b85c-6a6eb0de5350,3) memb {
    e7516d17-8e6a-11e3-b85c-6a6eb0de5350,0
    } joined {
    } left {
    } partitioned {
    fc04cf52-8e6a-11e3-b0f9-93a4b1f2a1d9,0
    })
    2014-02-05 16:02:05 19183 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
    2014-02-05 16:02:05 19183 [Note] WSREP: Flow-control interval: [16, 16]
    2014-02-05 16:02:05 19183 [Note] WSREP: Received NON-PRIMARY.
    2014-02-05 16:02:05 19183 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 192992574)
    2014-02-05 16:02:05 19183 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
    2014-02-05 16:02:05 19183 [Note] WSREP: Flow-control interval: [16, 16]
    2014-02-05 16:02:05 19183 [Note] WSREP: Received NON-PRIMARY.
    2014-02-05 16:02:05 19183 [Note] WSREP: New cluster view: global state: 03b25294-7b07-11e3-ac2e-362fc6d31d98:192992574, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
    2014-02-05 16:02:05 19183 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
    2014-02-05 16:02:05 19183 [Note] WSREP: New cluster view: global state: 03b25294-7b07-11e3-ac2e-362fc6d31d98:192992574, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
    2014-02-05 16:02:05 19183 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
    2014-02-05 16:02:06 19183 [Note] WSREP: (e7516d17-8e6a-11e3-b85c-6a6eb0de5350, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.10.1:4567
    2014-02-05 16:02:07 19183 [Note] WSREP: (e7516d17-8e6a-11e3-b85c-6a6eb0de5350, 'tcp://0.0.0.0:4567') reconnecting to fc04cf52-8e6a-11e3-b0f9-93a4b1f2a1d9 (tcp://192.168.10.1:4567), attempt 0
    2014-02-05 16:02:52 19183 [Note] WSREP: (e7516d17-8e6a-11e3-b85c-6a6eb0de5350, 'tcp://0.0.0.0:4567') reconnecting to fc04cf52-8e6a-11e3-b0f9-93a4b1f2a1d9 (tcp://192.168.10.1:4567), attempt 30
    2014-02-05 16:03:37 19183 [Note] WSREP: (e7516d17-8e6a-11e3-b85c-6a6eb0de5350, 'tcp://0.0.0.0:4567') reconnecting to fc04cf52-8e6a-11e3-b0f9-93a4b1f2a1d9 (tcp://192.168.10.1:4567), attempt 60
    2014-02-05 16:04:22 19183 [Note] WSREP: (e7516d17-8e6a-11e3-b85c-6a6eb0de5350, 'tcp://0.0.0.0:4567') reconnecting to fc04cf52-8e6a-11e3-b0f9-93a4b1f2a1d9 (tcp://192.168.10.1:4567), attempt 90

    So this node try to connect to a node wich is down instead of stay in the cluster alone.
    To force him to connect to himself and make a one node cluster synchronised, I have to force it by issuing :
    mysql> set global wsrep_cluster_address="gcomm://";
    Last edited by Delard; 02-05-2014, 10:38 AM.

  • #2
    Up

    And to add some information again, I found a way to work around the problem by adding pc.ignore_sb = yes in wsrep_provider_options.

    Does somebody have an idea on this please ?

    Comment


    • #3
      Do not use split brain(pc.ignore_sb), unless its emergency.
      How did you setup the cluster.? did you follow the standard procedure..? http://www.percona.com/doc/percona-x...tallation.html

      Try this...
      disable pc.ignore_sb by commenting it out.
      Double check the my.cnf configuration on all nodes, & set the gcomm values accordingly(replace node1,node2,node3 with their IPs).
      node1 -> gcomm://
      node2 ->gcomm://node1,node2,node3
      node3 ->gcomm://node1,node2,node3


      Then after all nodes synched change the gcomm value of node1 to gcomm://node1,node2,node3 and restart mysql on that node1.

      ​To check if nodes are synced or not, login into the mysql prompt of any node and enter this command

      show status like 'wsrep%';
      Last edited by madhusudan; 02-10-2014, 07:22 AM.

      Comment


      • #4
        Yeah I didn't use pc.ignore_sb. It was just to try to be more explicit.

        The thing is, I used to not mention the ip of the node in gcomm://, like this :

        node1 -> gcomm://
        node2 ->gcomm://node1,node3
        node3 ->gcomm://node1,node2

        And yes, the
        nodes was synced using this configuration and checking via show status like 'wsrep%';
        I will give a try to your config to see if there is some change.
        I also upgraded to the last stable release and the problem is the same.

        # dpkg -l | grep percona
        ii percona-toolkit 2.2.6 all Advanced MySQL and system command-line tools
        ii percona-xtrabackup 2.1.7-721-1.wheezy amd64 Open source backup tool for InnoDB and XtraDB
        ii percona-xtradb-cluster-client-5.6 5.6.15-25.3-711.wheezy amd64 Percona Server database client binaries
        ii percona-xtradb-cluster-common-5.6 5.6.15-25.3-711.wheezy amd64 Percona Server database common files (e.g. /etc/mysql/my.cnf)
        ii percona-xtradb-cluster-galera-3.x 189.wheezy amd64 Galera components of Percona XtraDB Cluster
        ii percona-xtradb-cluster-server-5.6 5.6.15-25.3-711.wheezy amd64 Percona Server database server binaries

        Comment


        • #5
          The reason I told to use IP's is no need for DNS lookup, if DNS fails, then the nodes cannot see each other!, only thing u have to make sure is the IP's should be static.
          also check any firewall or other network issue that's preventing these nodes to connect each other.

          Comment

          Working...
          X