Emergency

weird behavior of IST syncronization

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • weird behavior of IST syncronization

    Hello,

    I have a xtradb cluster distributed between 3 nodes and one of the datacenters recently had couple internet outages which were causing node there to run out of sync. For a few days I were trying to figure out why IST sync were failing back to SST.
    After bunch of tries, errors and starts problem were resolved on its own. Any ideas what it could be? The way I was doing a test were to block incoming and outgoing requests to rest of the cluster nodes with iptables

    Failed IST sync:
    018-01-10T04:34:07.381530Z 0 [Note] WSREP: Flow-control interval: [173, 173]
    2018-01-10T04:34:07.381538Z 0 [Note] WSREP: Trying to continue unpaused monitor
    2018-01-10T04:34:07.381546Z 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 1687849)
    2018-01-10T04:34:07.381688Z 1 [Note] WSREP: State transfer required:
    Group state: ceb51ea0-5a1e-11e7-accd-42d4587e27e4:1687849
    Local state: ceb51ea0-5a1e-11e7-accd-42d4587e27e4:1687846
    2018-01-10T04:34:07.381722Z 1 [Note] WSREP: New cluster view: global state: ceb51ea0-5a1e-11e7-accd-42d4587e27e4:1687849, view# 243: Primary, number of nodes: 3, my index: 2, protocol version 3
    2018-01-10T04:34:07.381732Z 1 [Note] WSREP: Setting wsrep_ready to true
    2018-01-10T04:34:07.381740Z 1 [Warning] WSREP: Gap in state sequence. Need state transfer.
    2018-01-10T04:34:07.381746Z 1 [Note] WSREP: Setting wsrep_ready to false
    2018-01-10T04:34:07.381755Z 1 [Note] WSREP: You have configured 'xtrabackup-v2' state snapshot transfer method which cannot be performed on a running server. Wsrep provider won't be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server.
    2018-01-10T04:34:07.381764Z 1 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
    2018-01-10T04:34:07.381777Z 1 [Note] WSREP: REPL Protocols: 7 (3, 2)
    2018-01-10T04:34:07.381831Z 1 [Note] WSREP: Assign initial position for certification: 1687849, protocol version: 3
    2018-01-10T04:34:07.382003Z 0 [Note] WSREP: Service thread queue flushed.
    2018-01-10T04:34:07.382121Z 1 [Note] WSREP: Check if state gap can be serviced using IST
    2018-01-10T04:34:07.382261Z 1 [Note] WSREP: IST receiver addr using tcp://publicip.com:4568
    2018-01-10T04:34:07.382358Z 1 [Note] WSREP: IST receiver bind using tcp://dbb:4568
    2018-01-10T04:34:07.382594Z 1 [Note] WSREP: Prepared IST receiver, listening at: tcp://10.0.0.51:4568
    2018-01-10T04:34:07.382624Z 1 [Note] WSREP: State gap can be likely serviced using IST. SST request though present would be void.
    2018-01-10T04:34:07.426046Z 0 [Note] WSREP: Member 2.0 (publicip.com) requested state transfer from '*any*'. Selected 1.0 (dbC)(SYNCED) as donor.
    2018-01-10T04:34:07.426086Z 0 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 1687849)
    2018-01-10T04:34:07.426226Z 1 [Note] WSREP: Requesting state transfer: success, donor: 1
    2018-01-10T04:34:07.426270Z 1 [Note] WSREP: GCache history reset: ceb51ea0-5a1e-11e7-accd-42d4587e27e4:1687846 -> ceb51ea0-5a1e-11e7-accd-42d4587e27e4:1687849
    2018-01-10T04:34:07.432080Z 1 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): full reset
    2018-01-10T04:34:07.440393Z 1 [Note] WSREP: Receiving IST: 3 writesets, seqnos 1687846-1687849
    2018-01-10T04:34:07.445561Z 0 [Note] WSREP: (91457fd6, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://dbcpublicip.com:4567 timed out, no messages seen in PT3S
    2018-01-10T04:34:07.576675Z 0 [ERROR] WSREP: got error while reading ist stream: asio.system:104
    2018-01-10T04:34:07.576725Z 0 [ERROR] WSREP: IST didn't contain all write sets, expected last: 1687849 last received: 1687846
    2018-01-10T04:34:07.576876Z 1 [ERROR] WSREP: receiving IST failed, node restart required: IST receiver reported error: 71 (Protocol error)
    at galera/src/ist.cpp:recv():558
    2018-01-10T04:34:07.579924Z 0 [Note] WSREP: 1.0 (dbC): State transfer to 2.0 (publicip.com) complete.
    2018-01-10T04:34:07.623296Z 0 [Note] WSREP: Member 1.0 (dbC) synced with group.
    2018-01-10T04:34:07.627104Z 1 [Note] WSREP: Closing send monitor...
    2018-01-10T04:34:07.627132Z 1 [Note] WSREP: Closed send monitor.
    2018-01-10T04:34:07.627147Z 1 [Note] WSREP: gcomm: terminating thread
    2018-01-10T04:34:07.627163Z 1 [Note] WSREP: gcomm: joining thread
    2018-01-10T04:34:07.627300Z 1 [Note] WSREP: gcomm: closing backend




    Successful IST sync:

    2018-01-10T04:59:10.932164Z 1 [Warning] WSREP: Gap in state sequence. Need state transfer.
    2018-01-10T04:59:10.932171Z 1 [Note] WSREP: Setting wsrep_ready to false
    2018-01-10T04:59:10.932181Z 1 [Note] WSREP: You have configured 'xtrabackup-v2' state snapshot transfer method which cannot be performed on a running server. Wsrep provider won't be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server.
    2018-01-10T04:59:10.932205Z 1 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
    2018-01-10T04:59:10.932220Z 1 [Note] WSREP: REPL Protocols: 7 (3, 2)
    2018-01-10T04:59:10.932251Z 1 [Note] WSREP: Assign initial position for certification: 1687902, protocol version: 3
    2018-01-10T04:59:10.932387Z 0 [Note] WSREP: Service thread queue flushed.
    2018-01-10T04:59:10.932506Z 1 [Note] WSREP: Check if state gap can be serviced using IST
    2018-01-10T04:59:10.932630Z 1 [Note] WSREP: IST receiver addr using tcp://publicdomain.com:4568
    2018-01-10T04:59:10.932706Z 1 [Note] WSREP: IST receiver bind using tcp://dbb:4568
    2018-01-10T04:59:10.932903Z 1 [Note] WSREP: Prepared IST receiver, listening at: tcp://10.0.0.51:4568
    2018-01-10T04:59:10.932931Z 1 [Note] WSREP: State gap can be likely serviced using IST. SST request though present would be void.
    2018-01-10T04:59:10.974374Z 0 [Note] WSREP: Member 2.0 (publicdomain.com) requested state transfer from '*any*'. Selected 1.0 (dbC)(SYNCED) as donor.
    2018-01-10T04:59:10.974404Z 0 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 1687902)
    2018-01-10T04:59:10.974510Z 1 [Note] WSREP: Requesting state transfer: success, donor: 1
    2018-01-10T04:59:10.974550Z 1 [Note] WSREP: GCache history reset: ceb51ea0-5a1e-11e7-accd-42d4587e27e4:1687899 -> ceb51ea0-5a1e-11e7-accd-42d4587e27e4:1687902
    2018-01-10T04:59:10.978535Z 1 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): full reset
    2018-01-10T04:59:10.986850Z 1 [Note] WSREP: Receiving IST: 3 writesets, seqnos 1687899-1687902
    2018-01-10T04:59:11.123203Z 0 [Note] WSREP: Receiving IST... 0.0% (0/3 events) complete.
    2018-01-10T04:59:11.123483Z 0 [Note] WSREP: 1.0 (dbC): State transfer to 2.0 (publicdomain.com) complete.
    2018-01-10T04:59:11.165109Z 0 [Note] WSREP: Member 1.0 (dbC) synced with group.
    2018-01-10T04:59:11.203793Z 0 [Note] WSREP: Receiving IST...100.0% (3/3 events) complete.
    2018-01-10T04:59:11.254440Z 1 [Note] WSREP: IST received: ceb51ea0-5a1e-11e7-accd-42d4587e27e4:1687902
    2018-01-10T04:59:11.296028Z 0 [Note] WSREP: 2.0 (publicdomain.com): State transfer from 1.0 (dbC) complete.
    2018-01-10T04:59:11.296061Z 0 [Note] WSREP: SST leaving flow control
    2018-01-10T04:59:11.296075Z 0 [Note] WSREP: Shifting JOINER -> JOINED (TO: 1687902)
    2018-01-10T04:59:11.337483Z 0 [Note] WSREP: Member 2.0 (publicdomain.com) synced with group.
    2018-01-10T04:59:11.337514Z 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 1687902)
    2018-01-10T04:59:11.337644Z 1 [Note] WSREP: Synchronized with group, ready for connections
    2018-01-10T04:59:11.337671Z 1 [Note] WSREP: This node is synced, setting wsrep_ready to true












  • #2
    PXC needs stable n/w connectivity. Failing which PXC can recover but may fall back to SST if IST is abruptly interrupted.
    It seems like you are likely going through the same scenario.

    2018-01-10T04:34:07.576675Z 0 [ERROR] WSREP: got error while reading ist stream: asio.system:104
    2018-01-10T04:34:07.576725Z 0 [ERROR] WSREP: IST didn't contain all write sets, expected last: 1687849 last received: 1687846
    2018-01-10T04:34:07.576876Z 1 [ERROR] WSREP: receiving IST failed, node restart required: IST receiver reported error: 71 (Protocol error)
    at galera/src/ist.cpp:recv():558

    Comment

    Working...
    X