Announcement

Announcement Module
Collapse
No announcement yet.

"exception in PC" on node 1 -> whole 3 node cluster froze

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • "exception in PC" on node 1 -> whole 3 node cluster froze

    I have a 3 node setup: nodes 1 and 2 in datacenter A, node 3 in datacenter B

    Today, node 1 failed after a lot of partitioning and resyncing of the cluster with the following message.

    Thereafter, the whole cluster froze with a some of the following messages on nodes 2 and 3:
    2014-02-16 12:57:47 2595 [Note] WSREP: Nodes 62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d are still in unknown state, unable to rebootstrap new prim

    Does anyone have some ideas on how to solve this issue?

    Thanks!

    Frank.


    ERROR ON NODE 1:

    2014-02-16 12:41:02 2504 [ERROR] WSREP: caught exception in PC, state dump to stderr follows:
    pc::Proto{uuid=62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,start_prim=0,npvo=0,ignore_sb=0,ignor e_quorum=0,state=1,last_sent_seq=4,checksum=0,inst ances=
    62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,prim=1,un=1,last_seq=4,last_prim=view _id(PRIM,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,46),to_seq=151253,weight=1,segment=1
    6811f0c9-9367-11e3-9044-cb32ea280bb8,prim=0,un=0,last_seq=1,last_prim=view _id(PRIM,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,40),to_seq=151199,weight=1,segment=2
    bbce851d-9367-11e3-8a0e-9a5107ef8b9f,prim=1,un=1,last_seq=43,last_prim=vie w_id(PRIM,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,46),to_seq=151253,weight=1,segment=1
    ,state_msgs=
    62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,prim=1,un=0,last_seq=4,last_prim=view _id(PRIM,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,46),to_seq=151253,weight=1,segment=1
    bbce851d-9367-11e3-8a0e-9a5107ef8b9f,prim=1,un=0,last_seq=43,last_prim=vie w_id(PRIM,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,46),to_seq=151253,weight=1,segment=1
    }}
    6811f0c9-9367-11e3-9044-cb32ea280bb8,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,prim=1,un=1,last_seq=4,last_prim=view _id(PRIM,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,40),to_seq=151199,weight=1,segment=1
    6811f0c9-9367-11e3-9044-cb32ea280bb8,prim=0,un=0,last_seq=1,last_prim=view _id(PRIM,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,40),to_seq=151199,weight=1,segment=2
    bbce851d-9367-11e3-8a0e-9a5107ef8b9f,prim=1,un=1,last_seq=2,last_prim=view _id(PRIM,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,40),to_seq=151199,weight=1,segment=1
    }}
    bbce851d-9367-11e3-8a0e-9a5107ef8b9f,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,prim=1,un=0,last_seq=4,last_prim=view _id(PRIM,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,46),to_seq=151253,weight=1,segment=1
    bbce851d-9367-11e3-8a0e-9a5107ef8b9f,prim=1,un=0,last_seq=43,last_prim=vie w_id(PRIM,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,46),to_seq=151253,weight=1,segment=1
    }}
    ,current_view=view(view_id(REG,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,49) memb {
    62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,0
    6811f0c9-9367-11e3-9044-cb32ea280bb8,0
    bbce851d-9367-11e3-8a0e-9a5107ef8b9f,0
    } joined {
    6811f0c9-9367-11e3-9044-cb32ea280bb8,0
    } left {
    } partitioned {
    }),pc_view=view(view_id(PRIM,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,46) memb {
    62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,1
    bbce851d-9367-11e3-8a0e-9a5107ef8b9f,1
    } joined {
    } left {
    } partitioned {
    }),mtu=32636}
    2014-02-16 12:41:02 2504 [Note] WSREP: evs::msg{version=0,type=1,user_type=255,order=4,se q=0,seq_range=0,aru_seq=-1,flags=4,source=bbce851d-9367-11e3-8a0e-9a5107ef8b9f,source_view_id=view_id(REG,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,49),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=3202335,node_list=()
    } 116
    2014-02-16 12:41:02 2504 [ERROR] WSREP: exception caused by message: evs::msg{version=0,type=3,user_type=255,order=1,se q=0,seq_range=-1,aru_seq=0,flags=4,source=6811f0c9-9367-11e3-9044-cb32ea280bb8,source_view_id=view_id(REG,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,49),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=3202238,node_list=()
    }
    state after handling message: evs:roto(evs:roto(62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d, OPERATIONAL, view_id(REG,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,49)), OPERATIONAL) {
    current_view=view(view_id(REG,62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,49) memb {
    62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,0
    6811f0c9-9367-11e3-9044-cb32ea280bb8,0
    bbce851d-9367-11e3-8a0e-9a5107ef8b9f,0
    } joined {
    } left {
    } partitioned {
    }),
    input_map=evs::input_map: {aru_seq=0,safe_seq=0,node_index=node: {idx=0,range=[1,0],safe_seq=0} node: {idx=1,range=[1,0],safe_seq=0} node: {idx=2,range=[1,0],safe_seq=0} },
    fifo_seq=3203297,
    last_sent=0,
    known={
    62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d,evs::node{operational=1,suspected=0,i nstalled=1,fifo_seq=-1,}
    6811f0c9-9367-11e3-9044-cb32ea280bb8,evs::node{operational=1,suspected=0,i nstalled=1,fifo_seq=3202238,}
    bbce851d-9367-11e3-8a0e-9a5107ef8b9f,evs::node{operational=1,suspected=0,i nstalled=1,fifo_seq=3202337,}
    }
    }2014-02-16 12:41:02 2504 [ERROR] WSREP: exception from gcomm, backend must be restarted: msg_state == local_state: 62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d node 62b5e5f0-9367-11e3-a0ac-abaf5f8dac6d prim state message and local states not consistent: msg node prim=1,un=0,last_seq=4,last_prim=view_id(PRIM,62b5 e5f0-9367-11e3-a0ac-abaf5f8dac6d,46),to_seq=151253,weight=1,segment=1 local state prim=1,un=1,last_seq=4,last_prim=view_id(PRIM,62b5 e5f0-9367-11e3-a0ac-abaf5f8dac6d,46),to_seq=151253,weight=1,segment=1 (FATAL)
    at gcomm/src/pc_proto.cpp:validate_state_msgs():607
    2014-02-16 12:41:02 2504 [Note] WSREP: Received self-leave message.
    2014-02-16 12:41:02 2504 [Note] WSREP: Flow-control interval: [0, 0]
    2014-02-16 12:41:02 2504 [Note] WSREP: Received SELF-LEAVE. Closing connection.
    2014-02-16 12:41:02 2504 [Note] WSREP: Shifting SYNCED -> CLOSED (TO: 2988725)
    2014-02-16 12:41:02 2504 [Note] WSREP: RECV thread exiting 0: Success
    2014-02-16 12:41:02 2504 [Note] WSREP: New cluster view: global state: 5dd126ae-2944-11e3-9d8e-a65147a95bff:2988725, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 2
    2014-02-16 12:41:17 2504 [Note] WSREP: applier thread exiting (code:0)
    2014-02-16 16:17:15 2504 [Warning] WSREP: gcs_caused() returned -103 (Software caused connection abort)
    2014-02-16 16:17:15 2504 [Warning] WSREP: gcs_caused() returned -103 (Software caused connection abort)
    2014-02-16 16:25:03 2504 [Note] /usr/sbin/mysqld: Normal shutdown

  • #2
    Well, I moved my third node to the same datacenter as nodes 1 and 2. Hope that this works as work-around... Although this is not a solution and I thought that XtraDB Cluster would be capable of working with nodes distributed across datacenters.

    Comment

    Working...
    X