Announcement

Announcement Module
Collapse
No announcement yet.

Percona XtraDB Cluster Node Crashing Once a Day

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Percona XtraDB Cluster Node Crashing Once a Day

    I have a 3 node Percona cluster setup on EC2 instances.

    Node 1 of the cluster (orginaly the primary node) is now failing once a day, and I can't figure out from the error log what is causing it.

    Log exert:

    Code:
    130604  7:32:18 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3
    130604  7:32:18 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
    130604  7:32:18 [Note] WSREP: STATE EXCHANGE: sent state msg: e2e2cfa1-cce8-11e2-0800-e0694f428b46
    130604  7:32:18 [Note] WSREP: STATE EXCHANGE: got state msg: e2e2cfa1-cce8-11e2-0800-e0694f428b46 from 0 (node2)
    130604  7:32:18 [Note] WSREP: STATE EXCHANGE: got state msg: e2e2cfa1-cce8-11e2-0800-e0694f428b46 from 2 (node1)
    130604  7:32:18 [Note] WSREP: STATE EXCHANGE: got state msg: e2e2cfa1-cce8-11e2-0800-e0694f428b46 from 1 (node3)
    130604  7:32:18 [Note] WSREP: Quorum results:
        version    = 2,
        component  = PRIMARY,
        conf_id    = 359,
        members    = 3/3 (joined/total),
        act_id     = 433339,
        last_appl. = 0,
        protocols  = 0/4/2 (gcs/repl/appl),
        group UUID = e38d9bfc-7d4e-11e2-0800-9446319d18cf
    130604  7:32:18 [Note] WSREP: Flow-control interval: [28, 28]
    130604  7:32:18 [Note] WSREP: Restored state OPEN -> SYNCED (433339)
    130604  7:32:18 [Note] WSREP: New cluster view: global state: e38d9bfc-7d4e-11e2-0800-9446319d18cf:433339, view# 360: Primary, number of nodes: 3, my index: 2, protocol version 2
    130604  7:32:18 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
    130604  7:32:18 [Note] WSREP: Assign initial position for certification: 433339, protocol version: 2
    130604  7:32:18 [Note] WSREP: Synchronized with group, ready for connections
    130604  7:32:18 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
    130604  7:36:57 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
    130604  7:37:04 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
    130604  8:07:09 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
    130604  8:07:09 [Note] WSREP: view(view_id(NON_PRIM,0ccb04a4-c970-11e2-0800-838856078bd7,477) memb {
        d44ec52a-cca5-11e2-0800-6a48c84c66c6,
    } joined {
    } left {
    } partitioned {
        0ccb04a4-c970-11e2-0800-838856078bd7,
        11f53dbb-c971-11e2-0800-987abecafd36,
    })
    130604  8:07:09 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
    130604  8:07:09 [Note] WSREP: Flow-control interval: [16, 16]
    130604  8:07:09 [Note] WSREP: Received NON-PRIMARY.
    130604  8:07:09 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 433353)
    130604  8:07:09 [Note] WSREP: New cluster view: global state: e38d9bfc-7d4e-11e2-0800-9446319d18cf:433353, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
    130604  8:07:09 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
    130604  8:07:09 [Note] WSREP: view(view_id(NON_PRIM,d44ec52a-cca5-11e2-0800-6a48c84c66c6,478) memb {
        d44ec52a-cca5-11e2-0800-6a48c84c66c6,
    } joined {
    } left {
    } partitioned {
        0ccb04a4-c970-11e2-0800-838856078bd7,
        11f53dbb-c971-11e2-0800-987abecafd36,
    })
    130604  8:07:09 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
    130604  8:07:09 [Note] WSREP: Flow-control interval: [16, 16]
    130604  8:07:09 [Note] WSREP: Received NON-PRIMARY.
    130604  8:07:09 [Note] WSREP: New cluster view: global state: e38d9bfc-7d4e-11e2-0800-9446319d18cf:433353, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
    130604  8:07:09 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
    130604  8:07:10 [Note] WSREP: declaring 0ccb04a4-c970-11e2-0800-838856078bd7 stable
    130604  8:07:10 [Note] WSREP: declaring 11f53dbb-c971-11e2-0800-987abecafd36 stable
    130604  8:07:10 [Note] WSREP: Node 0ccb04a4-c970-11e2-0800-838856078bd7 state prim
    130604  8:07:10 [Note] WSREP: view(view_id(PRIM,0ccb04a4-c970-11e2-0800-838856078bd7,479) memb {
        0ccb04a4-c970-11e2-0800-838856078bd7,
        11f53dbb-c971-11e2-0800-987abecafd36,
        d44ec52a-cca5-11e2-0800-6a48c84c66c6,
    } joined {
    } left {
    } partitioned {
    })
    130604  8:07:10 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3
    130604  8:07:10 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
    130604  8:07:10 [Note] WSREP: STATE EXCHANGE: sent state msg: c19d7175-cced-11e2-0800-74f173924529
    130604  8:07:10 [Note] WSREP: STATE EXCHANGE: got state msg: c19d7175-cced-11e2-0800-74f173924529 from 0 (node2)
    130604  8:07:10 [Note] WSREP: STATE EXCHANGE: got state msg: c19d7175-cced-11e2-0800-74f173924529 from 1 (node3)
    130604  8:07:10 [Note] WSREP: STATE EXCHANGE: got state msg: c19d7175-cced-11e2-0800-74f173924529 from 2 (node1)
    130604  8:07:10 [Note] WSREP: Quorum results:
        version    = 2,
        component  = PRIMARY,
        conf_id    = 360,
        members    = 2/3 (joined/total),
        act_id     = 433354,
        last_appl. = 0,
        protocols  = 0/4/2 (gcs/repl/appl),
        group UUID = e38d9bfc-7d4e-11e2-0800-9446319d18cf
    130604  8:07:10 [Note] WSREP: Flow-control interval: [28, 28]
    130604  8:07:10 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 433354)
    130604  8:07:10 [Note] WSREP: State transfer required:
        Group state: e38d9bfc-7d4e-11e2-0800-9446319d18cf:433354
        Local state: e38d9bfc-7d4e-11e2-0800-9446319d18cf:433353
    130604  8:07:10 [Note] WSREP: New cluster view: global state: e38d9bfc-7d4e-11e2-0800-9446319d18cf:433354, view# 361: Primary, number of nodes: 3, my index: 2, protocol version 2
    130604  8:07:10 [Warning] WSREP: Gap in state sequence. Need state transfer.
    130604  8:07:16 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
    130604  8:07:16 [Note] WSREP: You have configured 'rsync' state snapshot transfer method which cannot be performed on a running server. Wsrep provider won't be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server.
    130604  8:07:16 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
    130604  8:07:16 [Note] WSREP: Assign initial position for certification: 433354, protocol version: 2
    130604  8:07:16 [Note] WSREP: Prepared IST receiver, listening at: tcp://10.58.249.142:4568
    130604  8:07:17 [Note] WSREP: (d44ec52a-cca5-11e2-0800-6a48c84c66c6, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.0.3:4567 tcp://192.168.0.2:4567
    130604  8:07:17 [Note] WSREP: (d44ec52a-cca5-11e2-0800-6a48c84c66c6, 'tcp://0.0.0.0:4567') reconnecting to 11f53dbb-c971-11e2-0800-987abecafd36 (tcp://192.168.0.3:4567), attempt 0
    130604  8:07:17 [Note] WSREP: (d44ec52a-cca5-11e2-0800-6a48c84c66c6, 'tcp://0.0.0.0:4567') reconnecting to 0ccb04a4-c970-11e2-0800-838856078bd7 (tcp://192.168.0.2:4567), attempt 0
    130604  8:07:20 [Note] WSREP: (d44ec52a-cca5-11e2-0800-6a48c84c66c6, 'tcp://0.0.0.0:4567') reconnecting to 11f53dbb-c971-11e2-0800-987abecafd36 (tcp://192.168.0.3:4567), attempt 0
    130604  8:07:20 [Note] WSREP: (d44ec52a-cca5-11e2-0800-6a48c84c66c6, 'tcp://0.0.0.0:4567') reconnecting to 0ccb04a4-c970-11e2-0800-838856078bd7 (tcp://192.168.0.2:4567), attempt 0
    130604  8:07:20 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
    130604  8:07:21 [Note] WSREP: (d44ec52a-cca5-11e2-0800-6a48c84c66c6, 'tcp://0.0.0.0:4567') reconnecting to 11f53dbb-c971-11e2-0800-987abecafd36 (tcp://192.168.0.3:4567), attempt 0
    130604  8:07:21 [Note] WSREP: (d44ec52a-cca5-11e2-0800-6a48c84c66c6, 'tcp://0.0.0.0:4567') reconnecting to 0ccb04a4-c970-11e2-0800-838856078bd7 (tcp://192.168.0.2:4567), attempt 0
    130604  8:07:21 [Note] WSREP: (d44ec52a-cca5-11e2-0800-6a48c84c66c6, 'tcp://0.0.0.0:4567') turning message relay requesting off
    130604  8:07:22 [Note] WSREP: evs::proto(d44ec52a-cca5-11e2-0800-6a48c84c66c6, OPERATIONAL, view_id(REG,0ccb04a4-c970-11e2-0800-838856078bd7,479)) suspecting node: 0ccb04a4-c970-11e2-0800-838856078bd7
    130604  8:07:22 [Note] WSREP: evs::proto(d44ec52a-cca5-11e2-0800-6a48c84c66c6, OPERATIONAL, view_id(REG,0ccb04a4-c970-11e2-0800-838856078bd7,479)) suspecting node: 11f53dbb-c971-11e2-0800-987abecafd36
    130604  8:07:23 [Note] WSREP: view(view_id(NON_PRIM,0ccb04a4-c970-11e2-0800-838856078bd7,479) memb {
        d44ec52a-cca5-11e2-0800-6a48c84c66c6,
    } joined {
    } left {
    } partitioned {
        0ccb04a4-c970-11e2-0800-838856078bd7,
        11f53dbb-c971-11e2-0800-987abecafd36,
    })
    130604  8:07:23 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
    130604 8:07:23 [Note] WSREP: Flow-control interval: [16, 16]
    130604 8:07:23 [Note] WSREP: Received NON-PRIMARY.
    130604 8:07:23 [Note] WSREP: Shifting PRIMARY -> OPEN (TO: 433354)
    130604 8:07:23 [ERROR] WSREP: gcs/src/gcs_group.c:gcs_group_handle_state_request():950: Requesting state transfer while in NON-PRIMARY. Ignoring.
    130604 8:07:23 [ERROR] WSREP: Requesting state transfer failed: -125(Operation canceled)
    130604 8:07:23 [ERROR] WSREP: State transfer request failed unrecoverably: 125 (Operation canceled). Most likely it is due to inability to communicate with the cluster primary$
    130604 8:07:23 [Note] WSREP: Closing send monitor...
    130604 8:07:23 [Note] WSREP: Closed send monitor.
    130604 8:07:23 [Note] WSREP: gcomm: terminating thread
    130604 8:07:23 [Note] WSREP: gcomm: joining thread
    130604 8:07:23 [Note] WSREP: view(view_id(NON_PRIM,d44ec52a-cca5-11e2-0800-6a48c84c66c6,480) memb {
    d44ec52a-cca5-11e2-0800-6a48c84c66c6,
    } joined {
    } left {
    } partitioned {
    0ccb04a4-c970-11e2-0800-838856078bd7,
    11f53dbb-c971-11e2-0800-987abecafd36,
    })
    130604 8:07:23 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
    130604 8:07:23 [Note] WSREP: gcomm: closing backend
    130604 8:07:23 [Note] WSREP: view((empty))
    130604 8:07:23 [Note] WSREP: gcomm: closed
    130604 8:07:23 [Note] WSREP: Flow-control interval: [16, 16]
    130604 8:07:23 [Note] WSREP: Received NON-PRIMARY.
    130604 8:07:23 [Note] WSREP: Received self-leave message.
    130604 8:07:23 [Note] WSREP: Flow-control interval: [0, 0]
    130604 8:07:23 [Note] WSREP: Received SELF-LEAVE. Closing connection.
    130604 8:07:23 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 433354)
    130604 8:07:23 [Note] WSREP: RECV thread exiting 0: Success
    130604 8:07:23 [Note] WSREP: recv_thread() joined.
    130604 8:07:23 [Note] WSREP: Closing slave action queue.
    130604 8:07:23 [Note] WSREP: /usr/sbin/mysqld: Terminated.
    If i restart the node, it re-joins the cluster fine and runs without errors until the next crash at approximately the same amount of time lapse from the restart as the last crash (hope that makes sense!).

    Any help or suggestions as to why this is occurring appreciated.

    ​Eli
    Last edited by elijahpaul; 06-06-2013, 09:01 AM.

  • #2
    It'd be good to see what is in the log before this (if anything). It looks to me like the node lost connection tot he cluster, tried to reconnect, couldn't get an IST, then shutdown as a result. Doesn't seem like a "crash" per-se.

    Comment


    • #3
      Yep. You're right. 'Crash' is not really the right word. I should really re-title it to 'Cluster Node Losing Connection Once a Day'.

      I've updated my original post with some of the log, before the node loses its connection.

      Hope that helps.

      Comment

      Working...
      X