GET 24/7 LIVE HELP NOW

Announcement

Announcement Module
Collapse
No announcement yet.

3 member cluster, 1 node keeps having to rejoin and IST

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • 3 member cluster, 1 node keeps having to rejoin and IST

    We have been running a 3 node cluster in production for about 2 months now, upgraded from 5.5 to 5.6 latest and Galera 3x with out issues. We use the third node (call it C) as an async master to a 5.6 slave running in our offsite location over an ssh tunnel. This node (C) does not serve any traffic. Also, we send all writes to Node A, and use Node A and B for reads.

    The problem that we are seeing is that the third node, in this case C always seems to disconnect from the cluster, and rejoins performing an IST. I have no idea why this is happening but it seems also to corrupt the async slave when it happens.

    Here is the full transaction of the cluster member C, leaving and joining the cluster. We have reinstalled this member on new hardware, renamed it, no matter what it always disconnects. The other members are humming along just fine. It seems to be limited to the third joining node.

    ERROR LOG - attachments don't work 2014-05-18 14:31:43 3167 [Note] WSREP: (c008a8e1-de1c-11e3-913f-d6d7b3e7ba76, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://XXX.XXX:4567
    2014-05-18 14:31:43 3167 [Note] WSREP: (c008a8e1-de1c-11e3-913f-d6d7b3e7ba76, 'tcp://0.0.0.0:4567') turning message relay requesting off
    2014-05-18 14:31:43 3167 [Warning] WSREP: last inactive check more than PT1.5S ago (PT5.55984S), skipping check
    2014-05-18 14:31:44 3167 [Note] WSREP: view(view_id(NON_PRIM,1469ea44-daff-11e3-97a5-8bd469d0b8fc,376) memb {
    c008a8e1-de1c-11e3-913f-d6d7b3e7ba76,0
    } joined {
    } left {
    } partitioned {
    1469ea44-daff-11e3-97a5-8bd469d0b8fc,0
    d62ce8fe-dad9-11e3-a835-d29c74758152,0
    })
    2014-05-18 14:31:44 3167 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
    2014-05-18 14:31:44 3167 [Note] WSREP: Flow-control interval: [500, 500]
    2014-05-18 14:31:44 3167 [Note] WSREP: Received NON-PRIMARY.
    2014-05-18 14:31:44 3167 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 58520762)
    2014-05-18 14:31:44 3167 [Note] WSREP: New cluster view: global state: 95890958-8d33-11e3-aea7-ff8c6255f30b:58520762, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
    2014-05-18 14:31:44 3167 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
    2014-05-18 14:31:44 3167 [Note] WSREP: view(view_id(NON_PRIM,c008a8e1-de1c-11e3-913f-d6d7b3e7ba76,377) memb {
    c008a8e1-de1c-11e3-913f-d6d7b3e7ba76,0
    } joined {
    } left {
    } partitioned {
    1469ea44-daff-11e3-97a5-8bd469d0b8fc,0
    d62ce8fe-dad9-11e3-a835-d29c74758152,0
    })
    2014-05-18 14:31:44 3167 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
    2014-05-18 14:31:44 3167 [Note] WSREP: Flow-control interval: [500, 500]
    2014-05-18 14:31:44 3167 [Note] WSREP: Received NON-PRIMARY.
    2014-05-18 14:31:44 3167 [Note] WSREP: New cluster view: global state: 95890958-8d33-11e3-aea7-ff8c6255f30b:58520762, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
    2014-05-18 14:31:44 3167 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
    2014-05-18 14:31:45 3167 [Note] WSREP: declaring 1469ea44-daff-11e3-97a5-8bd469d0b8fc stable
    2014-05-18 14:31:45 3167 [Note] WSREP: declaring d62ce8fe-dad9-11e3-a835-d29c74758152 stable
    2014-05-18 14:31:46 3167 [Note] WSREP: Node 1469ea44-daff-11e3-97a5-8bd469d0b8fc state prim
    2014-05-18 14:31:46 3167 [Note] WSREP: view(view_id(PRIM,1469ea44-daff-11e3-97a5-8bd469d0b8fc,378) memb {
    1469ea44-daff-11e3-97a5-8bd469d0b8fc,0
    c008a8e1-de1c-11e3-913f-d6d7b3e7ba76,0
    d62ce8fe-dad9-11e3-a835-d29c74758152,0
    } joined {
    } left {
    } partitioned {
    })
    2014-05-18 14:31:46 3167 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3
    2014-05-18 14:31:46 3167 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
    2014-05-18 14:31:46 3167 [Note] WSREP: STATE EXCHANGE: sent state msg: 23c6bd5c-de99-11e3-9cd7-bb91f8f93683
    2014-05-18 14:31:46 3167 [Note] WSREP: STATE EXCHANGE: got state msg: 23c6bd5c-de99-11e3-9cd7-bb91f8f93683 from 0 (pxc-01)
    2014-05-18 14:31:46 3167 [Note] WSREP: STATE EXCHANGE: got state msg: 23c6bd5c-de99-11e3-9cd7-bb91f8f93683 from 1 (pxc-04)
    2014-05-18 14:31:46 3167 [Note] WSREP: STATE EXCHANGE: got state msg: 23c6bd5c-de99-11e3-9cd7-bb91f8f93683 from 2 (pxc-02)
    2014-05-18 14:31:46 3167 [Note] WSREP: Quorum results:
    version = 3,
    component = PRIMARY,
    conf_id = 295,
    members = 2/3 (joined/total),
    act_id = 58520767,
    last_appl. = 58520707,
    protocols = 0/5/2 (gcs/repl/appl),
    group UUID = 95890958-8d33-11e3-aea7-ff8c6255f30b
    2014-05-18 14:31:46 3167 [Note] WSREP: Flow-control interval: [500, 500]
    2014-05-18 14:31:46 3167 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 58520767)
    2014-05-18 14:31:46 3167 [Note] WSREP: State transfer required:
    Group state: 95890958-8d33-11e3-aea7-ff8c6255f30b:58520767
    Local state: 95890958-8d33-11e3-aea7-ff8c6255f30b:58520762
    2014-05-18 14:31:46 3167 [Note] WSREP: New cluster view: global state: 95890958-8d33-11e3-aea7-ff8c6255f30b:58520767, view# 296: Primary, number of nodes: 3, my index: 1, protocol version 2
    2014-05-18 14:31:46 3167 [Warning] WSREP: Gap in state sequence. Need state transfer.
    2014-05-18 14:31:48 3167 [Note] WSREP: You have configured 'xtrabackup' state snapshot transfer method which cannot be performed on a running server. Wsrep provider won't be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server.
    2014-05-18 14:31:48 3167 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
    2014-05-18 14:31:48 3167 [Note] WSREP: REPL Protocols: 5 (3, 1)
    2014-05-18 14:31:48 3167 [Note] WSREP: Service thread queue flushed.
    2014-05-18 14:31:48 3167 [Note] WSREP: Assign initial position for certification: 58520767, protocol version: 3
    2014-05-18 14:31:48 3167 [Note] WSREP: Service thread queue flushed.
    2014-05-18 14:31:48 3167 [Warning] WSREP: Releasing seqno 58520762 before 58520763 was assigned.
    2014-05-18 14:31:48 3167 [Note] WSREP: Prepared IST receiver, listening at: tcp://XXX.XXX:4568
    2014-05-18 14:31:48 3167 [Note] WSREP: Member 1.0 (pxc-04) requested state transfer from 'pxc-02'. Selected 2.0 (pxc-02)(SYNCED) as donor.
    2014-05-18 14:31:48 3167 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 58520784)
    2014-05-18 14:31:48 3167 [Note] WSREP: Requesting state transfer: success, donor: 2
    2014-05-18 14:31:48 3167 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): discarded 0 bytes
    2014-05-18 14:31:48 3167 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): found 1/17 locked buffers
    2014-05-18 14:31:48 3167 [Note] WSREP: Receiving IST: 5 writesets, seqnos 58520762-58520767
    2014-05-18 14:31:48 3167 [Note] WSREP: IST received: 95890958-8d33-11e3-aea7-ff8c6255f30b:58520767
    2014-05-18 14:31:48 3167 [Note] WSREP: 1.0 (pxc-04): State transfer from 2.0 (pxc-02) complete.
    2014-05-18 14:31:48 3167 [Note] WSREP: Shifting JOINER -> JOINED (TO: 58520784)
    2014-05-18 14:31:48 3167 [Note] WSREP: 2.0 (pxc-02): State transfer to 1.0 (pxc-04) complete.
    2014-05-18 14:31:48 3167 [Note] WSREP: Member 1.0 (pxc-04) synced with group.
    2014-05-18 14:31:48 3167 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 58520784)
    2014-05-18 14:31:48 3167 [Note] WSREP: Member 2.0 (pxc-02) synced with group.
    2014-05-18 14:31:48 3167 [Note] WSREP: Synchronized with group, ready for connections
    2014-05-18 14:31:48 3167 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.




  • #2
    This is my.cnf on member C
    [mysql]

    port = 3306
    socket = /var/run/mysqld/mysqld.sock

    [client]

    user = redacted
    password = redacted

    [mysqld_safe]

    [mysqld]
    bind-address = 192.168.xxx.xxx
    port = 3306

    # CLUSTER #
    server_id = redacted
    binlog_format = ROW
    log_bin = mysql-bin
    wsrep_cluster_address = gcomm://192.168.xxx.xxx,192.168.xxx.xxx
    # wsrep_cluster_address = gcomm:// # do this for the first one only
    wsrep_sst_donor = pxc-02
    wsrep_node_address = 192.168.xxx.xxx
    wsrep_provider = /usr/lib/libgalera_smm.so
    wsrep_provider_options = "gcache.size=3G; gcs.fc_limit=500; gcs.fc_master_slave=YES; gcs.fc_factor=1.0"
    wsrep_slave_threads = 16
    wsrep_cluster_name = REDACTED
    wsrep_sst_method = xtrabackup
    wsrep_sst_auth = redacted:redacted
    wsrep_node_name = pxc-04
    log_slave_updates
    innodb_locks_unsafe_for_binlog = 1
    innodb_autoinc_lock_mode = 2

    # GENERAL #
    user = mysql
    default-storage-engine = InnoDB
    socket = /var/run/mysqld/mysqld.sock
    pid-file = /var/run/mysqld/mysql.pid

    # MyISAM #
    key-buffer-size = 32M
    myisam-recover = FORCE,BACKUP

    # SAFETY #
    max-allowed-packet = 16M
    max-connect-errors = 1000000
    skip-name-resolve
    sql-mode = STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_ AUTO_CREATE_USER,NO_AUTO_VALUE_ON_ZERO,NO_ENGINE_S UBSTITUTION,NO_ZERO_DATE,NO_ZERO_IN_DATE
    sysdate-is-now = 1
    #innodb = FORCE

    # DATA STORAGE #
    datadir = /data/mysql/

    # BINARY LOGGING #
    log-bin = /data/mysql/mysql-bin
    expire-logs-days = 10
    sync-binlog = 0
    gtid_mode = on #Global Transaction ID for async slaves
    enforce_gtid_consistency = 1
    # CACHES AND LIMITS #
    tmp-table-size = 32M
    max-heap-table-size = 32M
    query-cache-type = 0
    query-cache-size = 0
    max-connections = 1000
    thread-cache-size = 50
    open-files-limit = 65535
    table-definition-cache = 4096
    table-open-cache = 4096

    # INNODB #
    innodb-flush-method = O_DIRECT
    innodb-log-files-in-group = 2
    innodb-log-file-size = 512M
    innodb-flush-log-at-trx-commit = 1
    innodb-file-per-table = 1
    innodb-buffer-pool-size = 26G

    # LOGGING #
    log-error = /var/log/mysql/mysql-error.log
    log-queries-not-using-indexes = 1
    slow-query-log = 1
    slow-query-log-file = /var/log/mysql/mysql-slow.log


    Comment


    • #3
      This is a corresponding connection from the donor side of the transaction.
      2014-05-18 19:36:45 2897 [Note] WSREP: (d62ce8fe-dad9-11e3-a835-d29c74758152, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.X.X:4567
      2014-05-18 19:36:46 2897 [Note] WSREP: (d62ce8fe-dad9-11e3-a835-d29c74758152, 'tcp://0.0.0.0:4567') reconnecting to c008a8e1-de1c-11e3-913f-d6d7b3e7ba76 (tcp://192.168.X.X:4567), attempt 0
      2014-05-18 19:36:48 2897 [Note] WSREP: (d62ce8fe-dad9-11e3-a835-d29c74758152, 'tcp://0.0.0.0:4567') turning message relay requesting off
      2014-05-18 19:36:48 2897 [Note] WSREP: evs:roto(d62ce8fe-dad9-11e3-a835-d29c74758152, GATHER, view_id(REG,1469ea44-daff-11e3-97a5-8bd469d0b8fc,394)) suspecting node: c008a8e1-de1c-11e3-913f-d6d7b3e7ba76
      2014-05-18 19:36:48 2897 [Note] WSREP: declaring 1469ea44-daff-11e3-97a5-8bd469d0b8fc stable
      2014-05-18 19:36:48 2897 [Note] WSREP: Node 1469ea44-daff-11e3-97a5-8bd469d0b8fc state prim
      2014-05-18 19:36:48 2897 [Note] WSREP: view(view_id(PRIM,1469ea44-daff-11e3-97a5-8bd469d0b8fc,395) memb {
      1469ea44-daff-11e3-97a5-8bd469d0b8fc,0
      d62ce8fe-dad9-11e3-a835-d29c74758152,0
      } joined {
      } left {
      } partitioned {
      c008a8e1-de1c-11e3-913f-d6d7b3e7ba76,0
      })
      2014-05-18 19:36:48 2897 [Note] WSREP: forgetting c008a8e1-de1c-11e3-913f-d6d7b3e7ba76 (tcp://192.168.31.58:4567)
      2014-05-18 19:36:48 2897 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
      2014-05-18 19:36:48 2897 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
      2014-05-18 19:36:48 2897 [Note] WSREP: STATE EXCHANGE: sent state msg: c0c6d62a-dec3-11e3-8037-5f9383edae98
      2014-05-18 19:36:48 2897 [Note] WSREP: STATE EXCHANGE: got state msg: c0c6d62a-dec3-11e3-8037-5f9383edae98 from 0 (pxc-01)
      2014-05-18 19:36:48 2897 [Note] WSREP: STATE EXCHANGE: got state msg: c0c6d62a-dec3-11e3-8037-5f9383edae98 from 1 (pxc-02)
      2014-05-18 19:36:48 2897 [Note] WSREP: Quorum results:
      version = 3,
      component = PRIMARY,
      conf_id = 307,
      members = 2/2 (joined/total),
      act_id = 58625222,
      last_appl. = 58625196,
      protocols = 0/5/2 (gcs/repl/appl),
      group UUID = 95890958-8d33-11e3-aea7-ff8c6255f30b
      2014-05-18 19:36:48 2897 [Note] WSREP: Flow-control interval: [23, 23]
      2014-05-18 19:36:48 2897 [Note] WSREP: New cluster view: global state: 95890958-8d33-11e3-aea7-ff8c6255f30b:58625222, view# 308: Primary, number of nodes: 2, my index: 1, protocol version 2
      2014-05-18 19:36:48 2897 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2014-05-18 19:36:48 2897 [Note] WSREP: REPL Protocols: 5 (3, 1)
      2014-05-18 19:36:48 2897 [Note] WSREP: Assign initial position for certification: 58625222, protocol version: 3
      2014-05-18 19:36:48 2897 [Note] WSREP: Service thread queue flushed.
      2014-05-18 19:36:50 2897 [Warning] WSREP: discarding established (time wait) c008a8e1-de1c-11e3-913f-d6d7b3e7ba76 (tcp://192.168.31.58:4567)
      2014-05-18 19:36:51 2897 [Warning] WSREP: discarding established (time wait) c008a8e1-de1c-11e3-913f-d6d7b3e7ba76 (tcp://192.168.31.58:4567)
      2014-05-18 19:36:52 2897 [Warning] WSREP: discarding established (time wait) c008a8e1-de1c-11e3-913f-d6d7b3e7ba76 (tcp://192.168.31.58:4567)
      2014-05-18 19:36:53 2897 [Note] WSREP: cleaning up c008a8e1-de1c-11e3-913f-d6d7b3e7ba76 (tcp://192.168.31.58:4567)
      2014-05-18 19:36:54 2897 [Note] WSREP: (d62ce8fe-dad9-11e3-a835-d29c74758152, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.31.58:4567
      2014-05-18 19:36:54 2897 [Note] WSREP: (d62ce8fe-dad9-11e3-a835-d29c74758152, 'tcp://0.0.0.0:4567') turning message relay requesting off
      2014-05-18 19:36:55 2897 [Note] WSREP: declaring 1469ea44-daff-11e3-97a5-8bd469d0b8fc stable
      2014-05-18 19:36:55 2897 [Note] WSREP: declaring c008a8e1-de1c-11e3-913f-d6d7b3e7ba76 stable
      2014-05-18 19:36:55 2897 [Note] WSREP: Node 1469ea44-daff-11e3-97a5-8bd469d0b8fc state prim
      2014-05-18 19:36:55 2897 [Note] WSREP: view(view_id(PRIM,1469ea44-daff-11e3-97a5-8bd469d0b8fc,396) memb {
      1469ea44-daff-11e3-97a5-8bd469d0b8fc,0
      c008a8e1-de1c-11e3-913f-d6d7b3e7ba76,0
      d62ce8fe-dad9-11e3-a835-d29c74758152,0
      } joined {
      } left {
      } partitioned {
      })
      2014-05-18 19:36:55 2897 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3
      2014-05-18 19:36:55 2897 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
      2014-05-18 19:36:55 2897 [Note] WSREP: STATE EXCHANGE: sent state msg: c4e9cfd1-dec3-11e3-a389-3b18183d7405
      2014-05-18 19:36:55 2897 [Note] WSREP: STATE EXCHANGE: got state msg: c4e9cfd1-dec3-11e3-a389-3b18183d7405 from 0 (pxc-01)
      2014-05-18 19:36:55 2897 [Note] WSREP: STATE EXCHANGE: got state msg: c4e9cfd1-dec3-11e3-a389-3b18183d7405 from 1 (pxc-04)
      2014-05-18 19:36:55 2897 [Note] WSREP: STATE EXCHANGE: got state msg: c4e9cfd1-dec3-11e3-a389-3b18183d7405 from 2 (pxc-02)
      2014-05-18 19:36:55 2897 [Note] WSREP: Quorum results:
      version = 3,
      component = PRIMARY,
      conf_id = 308,
      members = 2/3 (joined/total),
      act_id = 58625267,
      last_appl. = 58625196,
      protocols = 0/5/2 (gcs/repl/appl),
      group UUID = 95890958-8d33-11e3-aea7-ff8c6255f30b
      2014-05-18 19:36:55 2897 [Note] WSREP: Flow-control interval: [28, 28]
      2014-05-18 19:36:55 2897 [Note] WSREP: New cluster view: global state: 95890958-8d33-11e3-aea7-ff8c6255f30b:58625267, view# 309: Primary, number of nodes: 3, my index: 2, protocol version 2
      2014-05-18 19:36:55 2897 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2014-05-18 19:36:55 2897 [Note] WSREP: REPL Protocols: 5 (3, 1)
      2014-05-18 19:36:55 2897 [Note] WSREP: Assign initial position for certification: 58625267, protocol version: 3
      2014-05-18 19:36:55 2897 [Note] WSREP: Service thread queue flushed.
      2014-05-18 19:36:57 2897 [Note] WSREP: Node 1.0 (pxc-04) requested state transfer from 'pxc-02'. Selected 2.0 (pxc-02)(SYNCED) as donor.
      2014-05-18 19:36:57 2897 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 58625283)
      2014-05-18 19:36:57 2897 [Note] WSREP: IST request: 95890958-8d33-11e3-aea7-ff8c6255f30b:58625215-58625267|tcp://192.168.X.X:4568
      2014-05-18 19:36:57 2897 [Note] WSREP: async IST sender starting to serve tcp://192.168.X.X:4568 sending 58625216-58625267
      2014-05-18 19:36:57 2897 [Note] WSREP: async IST sender served
      2014-05-18 19:36:57 2897 [Note] WSREP: 2.0 (pxc-02): State transfer to 1.0 (pxc-04) complete.
      2014-05-18 19:36:57 2897 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 58625283)
      2014-05-18 19:36:57 2897 [Note] WSREP: Member 2 (pxc-02) synced with group.
      2014-05-18 19:36:57 2897 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 58625283)
      2014-05-18 19:36:57 2897 [Note] WSREP: Synchronized with group, ready for connections
      2014-05-18 19:36:57 2897 [Note] WSREP: 1.0 (pxc-04): State transfer from 2.0 (pxc-02) complete.
      2014-05-18 19:36:57 2897 [Note] WSREP: Member 1 (pxc-04) synced with group.

      Comment


      • #4
        The slave end of failing:
        Last_IO_Errno: 1593
        Last_IO_Error: The slave I/O thread stops because a fatal error is encountered when it tried to SELECT @master_binlog_checksum. Error: Unknown command

        Comment

        Working...
        X