GET 24/7 LIVE HELP NOW

Announcement

Announcement Module
Collapse
No announcement yet.

Confusion regarding node health and nodes acting bizzarely

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Confusion regarding node health and nodes acting bizzarely

    I have a bit of confusion about percona XtraDB cluster.

    Everything I've read says that if a node isn't healty it won't run queries, so how do you define healthy?

    and I am having a wierd problem as well...

    Every time I restart one of the nodes, whether it's a reboot, or just
    doing 'service mysql restart' the first time the server tries to start
    it fails like this:

    120601 16:20:00 [Note] WSREP: Flow-control interval: [14, 28]
    120601 16:20:00 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 44436552)
    120601 16:20:00 [Note] WSREP: State transfer required:
    Group state: 09ca5817-a4d4-11e1-0800-63fd2729e5d0:44436552
    Local state: 09ca5817-a4d4-11e1-0800-63fd2729e5d0:44414916
    120601 16:20:00 [Note] WSREP: New cluster view: global state: 09ca5817-
    a4d4-11e1-0800-63fd2729e5d0:44436552, view# 45: Primary, number of
    nodes: 3, my index: 1, protocol version 1
    120601 16:20:00 [Warning] WSREP: Gap in state sequence. Need state
    transfer.
    120601 16:20:02 [Note] WSREP: Running: 'wsrep_sst_rsync 'joiner'
    '10.1.0.3' '' '/var/lib/mysql/' '/etc/my.cnf' '11636' 2>sst.err'
    120601 16:20:03 [Note] WSREP: Prepared SST request: rsync|
    10.1.0.3:4444/rsync_sst
    120601 16:20:03 [Note] WSREP: wsrep_notify_cmd is not defined,
    skipping notification.
    120601 16:20:03 [Note] WSREP: Assign initial position for
    certification: 44436552, protocol version: 2
    120601 16:20:03 [Note] WSREP: Prepared IST receiver, listening at:
    tcp://[::1]:4568
    120601 16:20:03 [Note] WSREP: Node 1 (node2) requested state transfer
    from '*any*'. Selected 0 (node3)(SYNCED) as donor.
    120601 16:20:03 [Note] WSREP: Shifting PRIMARY -> JOINER (TO:
    44437207)
    120601 16:20:03 [Note] WSREP: Requesting state transfer: success,
    donor: 0
    120601 16:20:03 [Warning] WSREP: 0 (node3): State transfer to 1
    (node2) failed: -111 (Connection refused)
    120601 16:20:03 [ERROR] WSREP: gcs/src/
    gcs_group.c:gcs_group_handle_join_msg():712: Will never receive state.
    Need to abort.
    120601 16:20:03 [Note] WSREP: gcomm: terminating thread
    120601 16:20:03 [Note] WSREP: gcomm: joining thread
    120601 16:20:03 [Note] WSREP: gcomm: closing backend
    120601 16:20:03 [Note] WSREP: view(view_id(NON_PRIM,0ab93977-
    ac20-11e1-0800-d704543b47b1,47) memb {
    29899d3e-ac27-11e1-0800-cd972382d31d,
    } joined {
    } left {
    } partitioned {
    0ab93977-ac20-11e1-0800-d704543b47b1,
    ea097a16-ac1f-11e1-0800-2c7896169250,
    })
    120601 16:20:03 [Note] WSREP: view((empty))
    120601 16:20:03 [Note] WSREP: gcomm: closed
    120601 16:20:03 [Note] WSREP: /usr/sbin/mysqld: Terminated.
    120601 16:20:03 mysqld_safe mysqld from pid file /var/lib/mysql/
    node2.cluster.net.pid ended

    on another cluster member it looks like this:

    120601 16:20:00 [Note] WSREP: New COMPONENT: primary = yes, bootstrap
    = no, my_idx = 0, memb_num = 3
    120601 16:20:00 [Note] WSREP: STATE_EXCHANGE: sent state UUID:
    29d90095-ac27-11e1-0800-155f0701382a
    120601 16:20:00 [Note] WSREP: STATE EXCHANGE: sent state msg: 29d90095-
    ac27-11e1-0800-155f0701382a
    120601 16:20:00 [Note] WSREP: STATE EXCHANGE: got state msg: 29d90095-
    ac27-11e1-0800-155f0701382a from 2 (node1)
    120601 16:20:00 [Note] WSREP: STATE EXCHANGE: got state msg: 29d90095-
    ac27-11e1-0800-155f0701382a from 0 (node3)
    120601 16:20:00 [Note] WSREP: STATE EXCHANGE: got state msg: 29d90095-
    ac27-11e1-0800-155f0701382a from 1 (node2)
    120601 16:20:00 [Note] WSREP: Quorum results:
    version = 2,
    component = PRIMARY,
    conf_id = 44,
    members = 2/3 (joined/total),
    act_id = 44436552,
    last_appl. = 44436263,
    protocols = 0/4/1 (gcs/repl/appl),
    group UUID = 09ca5817-a4d4-11e1-0800-63fd2729e5d0
    120601 16:20:00 [Note] WSREP: Flow-control interval: [14, 28]
    120601 16:20:00 [Note] WSREP: New cluster view: global state: 09ca5817-
    a4d4-11e1-0800-63fd2729e5d0:44436552, view# 45: Primary, number of
    nodes: 3, my index: 0, protocol version 1
    120601 16:20:00 [Note] WSREP: wsrep_notify_cmd is not defined,
    skipping notification.
    120601 16:20:00 [Note] WSREP: Assign initial position for
    certification: 44436552, protocol version: 2
    120601 16:20:03 [Note] WSREP: Node 1 (node2) requested state transfer
    from '*any*'. Selected 0 (node3)(SYNCED) as donor.
    120601 16:20:03 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO:
    44437207)
    120601 16:20:03 [Note] WSREP: IST request: 09ca5817-
    http://a4d4-11e1-0800-63fd2729e5d0:4...4436552|tcp://[::1]:4568
    120601 16:20:03 [Note] WSREP: wsrep_notify_cmd is not defined,
    skipping notification.
    120601 16:20:03 [Note] WSREP: Running: 'wsrep_sst_rsync 'donor'
    '10.1.0.3:4444/rsync_sst' '(null)' '/var/lib/mysql/' '/etc/my.cnf'
    '09ca5817-a4d4-11e1-0800-63fd2729e5d0' '44414916' '1''
    120601 16:20:03 [Note] WSREP: sst_donor_thread signaled with 0
    120601 16:20:03 [ERROR] WSREP: IST failed: IST sender, failed to
    connect 'tcp://[::1]:4568': Connection refused: 111 (Connection
    refused)
    at galera/src/ist.cpp:Sender():622
    120601 16:20:03 [Warning] WSREP: 0 (node3): State transfer to 1
    (node2) failed: -111 (Connection refused)
    120601 16:20:03 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO:
    44437207)
    120601 16:20:03 [Note] WSREP: Member 0 (node3) synced with group.
    120601 16:20:03 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 44437207)
    120601 16:20:03 [Note] WSREP: declaring ea097a16-
    ac1f-11e1-0800-2c7896169250 stable
    120601 16:20:03 [Note] WSREP: Synchronized with group, ready for
    connections
    120601 16:20:03 [Note] WSREP: wsrep_notify_cmd is not defined,
    skipping notification.
    120601 16:20:03 [Note] WSREP: (0ab93977-ac20-11e1-0800-d704543b47b1,
    'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive
    peers: tcp://10.1.0.3:4567
    120601 16:20:03 [Note] WSREP: view(view_id(PRIM,0ab93977-
    ac20-11e1-0800-d704543b47b1,48) memb {
    0ab93977-ac20-11e1-0800-d704543b47b1,
    ea097a16-ac1f-11e1-0800-2c7896169250,
    } joined {
    } left {
    } partitioned {
    29899d3e-ac27-11e1-0800-cd972382d31d,
    })
    120601 16:20:03 [Note] WSREP: forgetting 29899d3e-ac27-11e1-0800-
    cd972382d31d (tcp://10.1.0.3:4567)
    120601 16:20:03 [Note] WSREP: (0ab93977-ac20-11e1-0800-d704543b47b1,
    'tcp://0.0.0.0:4567') turning message relay requesting off
    120601 16:20:03 [Note] WSREP: New COMPONENT: primary = yes, bootstrap
    = no, my_idx = 0, memb_num = 2
    120601 16:20:03 [Note] WSREP: STATE_EXCHANGE: sent state UUID:
    2b6fc342-ac27-11e1-0800-6abab72f748a
    120601 16:20:03 [Note] WSREP: STATE EXCHANGE: sent state msg: 2b6fc342-
    ac27-11e1-0800-6abab72f748a
    120601 16:20:03 [Note] WSREP: STATE EXCHANGE: got state msg: 2b6fc342-
    ac27-11e1-0800-6abab72f748a from 0 (node3)
    120601 16:20:03 [Note] WSREP: STATE EXCHANGE: got state msg: 2b6fc342-
    ac27-11e1-0800-6abab72f748a from 1 (node1)
    120601 16:20:03 [Note] WSREP: Quorum results:
    version = 2,
    component = PRIMARY,
    conf_id = 45,
    members = 2/2 (joined/total),
    act_id = 44437208,
    last_appl. = 44436263,
    protocols = 0/4/1 (gcs/repl/appl),
    group UUID = 09ca5817-a4d4-11e1-0800-63fd2729e5d0
    120601 16:20:03 [Note] WSREP: Flow-control interval: [12, 23]
    120601 16:20:03 [Note] WSREP: New cluster view: global state: 09ca5817-
    a4d4-11e1-0800-63fd2729e5d0:44437208, view# 46: Primary, number of
    nodes: 2, my index: 0, protocol version 1
    120601 16:20:03 [Note] WSREP: wsrep_notify_cmd is not defined,
    skipping notification.
    120601 16:20:03 [Note] WSREP: Assign initial position for
    certification: 44437208, protocol version: 2
    120601 16:20:03 [ERROR] WSREP: sst sent called when not SST donor,
    state SYNCED
    120601 16:20:08 [Note] WSREP: cleaning up 29899d3e-ac27-11e1-0800-
    cd972382d31d (tcp://10.1.0.3:4567)

    If I then issue 'service mysql start' again on the second try the
    service always starts up, but that is where it gets really odd.

    The node that I restarted actually synchronizes up to the cluster, but
    then another node gets stuck in 'DONOR (+)' state (forever).like this:

    120601 16:37:14 [Note] WSREP: declaring 9213f9bf-ac29-11e1-0800-
    c51e2924d1f3 stable
    120601 16:37:14 [Note] WSREP: declaring ea097a16-
    ac1f-11e1-0800-2c7896169250 stable
    120601 16:37:14 [Note] WSREP: view(view_id(PRIM,0ab93977-
    ac20-11e1-0800-d704543b47b1,49) memb {
    0ab93977-ac20-11e1-0800-d704543b47b1,
    9213f9bf-ac29-11e1-0800-c51e2924d1f3,
    ea097a16-ac1f-11e1-0800-2c7896169250,
    } joined {
    } left {
    } partitioned {
    })
    120601 16:37:14 [Note] WSREP: New COMPONENT: primary = yes, bootstrap
    = no, my_idx = 0, memb_num = 3
    120601 16:37:14 [Note] WSREP: STATE_EXCHANGE: sent state UUID:
    92645987-ac29-11e1-0800-d98cd11aaec5
    120601 16:37:14 [Note] WSREP: STATE EXCHANGE: sent state msg: 92645987-
    ac29-11e1-0800-d98cd11aaec5
    120601 16:37:14 [Note] WSREP: STATE EXCHANGE: got state msg: 92645987-
    ac29-11e1-0800-d98cd11aaec5 from 0 (node3)
    120601 16:37:14 [Note] WSREP: STATE EXCHANGE: got state msg: 92645987-
    ac29-11e1-0800-d98cd11aaec5 from 2 (node1)
    120601 16:37:15 [Note] WSREP: STATE EXCHANGE: got state msg: 92645987-
    ac29-11e1-0800-d98cd11aaec5 from 1 (node2)
    120601 16:37:15 [Note] WSREP: Quorum results:
    version = 2,
    component = PRIMARY,
    conf_id = 46,
    members = 2/3 (joined/total),
    act_id = 44654372,
    last_appl. = 44654248,
    protocols = 0/4/1 (gcs/repl/appl),
    group UUID = 09ca5817-a4d4-11e1-0800-63fd2729e5d0
    120601 16:37:15 [Note] WSREP: Flow-control interval: [14, 28]
    120601 16:37:15 [Note] WSREP: New cluster view: global state: 09ca5817-
    a4d4-11e1-0800-63fd2729e5d0:44654372, view# 47: Primary, number of
    nodes: 3, my index: 0, protocol version 1
    120601 16:37:15 [Note] WSREP: wsrep_notify_cmd is not defined,
    skipping notification.
    120601 16:37:15 [Note] WSREP: Assign initial position for
    certification: 44654372, protocol version: 2
    120601 16:37:17 [Note] WSREP: Node 1 (node2) requested state transfer
    from '*any*'. Selected 0 (node3)(SYNCED) as donor.
    120601 16:37:17 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO:
    44655067)
    120601 16:37:17 [Note] WSREP: wsrep_notify_cmd is not defined,
    skipping notification.
    120601 16:37:17 [Note] WSREP: Running: 'wsrep_sst_rsync 'donor'
    '10.1.0.3:4444/rsync_sst' '(null)' '/var/lib/mysql/' '/etc/my.cnf'
    '09ca5817-a4d4-11e1-0800-63fd2729e5d0' '44655067' '0''
    120601 16:37:17 [Note] WSREP: sst_donor_thread signaled with 0
    120601 16:37:17 [Note] WSREP: Flushing tables for SST...
    120601 16:37:17 [Note] WSREP: Provider paused at 09ca5817-
    a4d4-11e1-0800-63fd2729e5d0:44655070
    120601 16:37:17 [Note] WSREP: Tables flushed.
    120601 16:37:40 [Note] WSREP: Provider resumed.
    120601 16:37:44 [Note] WSREP: 1 (node2): State transfer from 0 (node3)
    complete.
    120601 16:37:49 [Note] WSREP: Member 1 (node2) synced with group.

    So then, if I restart this node, it fails and then if I start it again
    it makes another node on the cluster nito a permanent donor and so on
    =) Essentially it seems to be impossible for me to get all 3 sync'd.

    configs:

    node 1
    [mysqld]
    datadir=/var/lib/mysql
    user=mysql
    binlog_format=ROW
    wsrep_provider=/usr/lib64/libgalera_smm.so
    wsrep_cluster_address=gcomm://10.1.0.3
    wsrep_slave_threads=2
    wsrep_cluster_name=cluster
    wsrep_sst_method=rsync
    wsrep_node_name=node1
    wsrep_sst_receive_address=10.1.0.2
    default_storage_engine=InnoDB
    innodb_locks_unsafe_for_binlog=1
    innodb_autoinc_lock_mode=2
    max_connections=2048
    max_allowed_packet=32M
    table_cache=2048
    thread_cache_size=32
    query_cache_size=256M
    innodb_buffer_pool_size=2048M
    sort_buffer_size=64M
    read_rnd_buffer_size=2M

    node 2

    [mysqld]
    datadir=/var/lib/mysql
    user=mysql
    binlog_format=ROW
    wsrep_provider=/usr/lib64/libgalera_smm.so
    wsrep_cluster_address=gcomm://10.1.0.4
    wsrep_sst_receive_address=10.1.0.3
    wsrep_slave_threads=2
    wsrep_cluster_name=cluster
    wsrep_sst_method=rsync
    wsrep_node_name=node2
    default_storage_engine=InnoDB
    innodb_locks_unsafe_for_binlog=1
    innodb_autoinc_lock_mode=2
    max_connections=2048
    max_allowed_packet=32M
    table_cache=2048
    thread_cache_size=32
    query_cache_size=256M
    innodb_buffer_pool_size=2048M
    sort_buffer_size=64M
    read_rnd_buffer_size=2M

    node 3

    [mysqld]
    datadir=/var/lib/mysql
    user=mysql
    binlog_format=ROW
    wsrep_provider=/usr/lib64/libgalera_smm.so
    wsrep_cluster_address=gcomm://10.1.0.2
    wsrep_sst_receive_address=10.1.0.4
    wsrep_slave_threads=2
    wsrep_cluster_name=cluster
    wsrep_sst_method=rsync
    wsrep_node_name=node3
    default_storage_engine=InnoDB
    innodb_locks_unsafe_for_binlog=1
    innodb_autoinc_lock_mode=2
    max_connections=2048
    max_allowed_packet=32M
    table_cache=2048
    thread_cache_size=32
    query_cache_size=256M
    innodb_buffer_pool_size=2048M
    sort_buffer_size=64M
    read_rnd_buffer_size=2M

    anyone have any idea what might be going on?

    I noticed the 'connection refused' messages and figured it was a firewall problem but there is no firewall running on EM2 (thats the interface name, this is a Dell server).

    Does anyone have any clue whats going on? =)

  • #2
    Here's your problem:

    120601 16:20:03 [Note] WSREP: Prepared IST receiver, listening at: tcp://[::1]:4568

    What did you do to get IST listening at localhost? Perhaps it is the lack of bind_address in config. What happens if you add bind_address=0.0.0.0 to my.cnf?

    Comment


    • #3
      Hmm, As you can see above I'm not setting the bind address anywhere and it's not in the instructions I followed but I can try that. If I wanted all of the cluster traffic to go over the interface EM2 would I set bind_address to be the IP of EM2?

      That is why I changed these lines:

      wsrep_sst_receive_address=10.1.0.4

      also, I dont think I even have IPv6 enabled on these boxes, so the fact that it's specifically binding to ::1 is a bit weird, I wonder if it is because galera is hard coded to look for eth[0-7] but dell servers use EM[0-7] for some crazy reason?

      120601 16:37:14 [ERROR] WSREP: Failed to read output of: '/sbin/ifconfig | grep -m1 -1 -E '^[a-z]?eth[0-9]' | tail -n 1 | awk '{ print $2 }' | awk -F : '{ print $2 }''

      would be neato if it checked the value of biosdevname (if it is not 0 then it should look at EM1 - EM9)

      here is info on that http://linux.dell.com/biosdevname/

      or if it had a nic-prefix setting like em or eth and then a start/end number.?

      Sorry I meant to ask, is there some way to get it to look for the NICS on EM1-8 instead of eth0-7?

      Comment


      • #4
        Quote:
        Hmm, As you can see above I'm not setting the bind address anywhere and it's not in the instructions I followed but I can try that.
        Please do.

        Quote:
        If I wanted all of the cluster traffic to go over the interface EM2 would I set bind_address to be the IP of EM2?
        Well, bind_address is for the client traffic, if you're talking about replication traffic, then the simplest way is to set wsrep_node_address to the IP of EM2. But please, try first with bind_address=0.0.0.0.

        Quote:
        Sorry I meant to ask, is there some way to get it to look for the NICS on EM1-8 instead of eth0-7
        Only by editing the code and rebuilding. However you should not rely on node address autodetection. It is just the best effort, in general there is no guarantee it will do the right thing. For best result configure addresses manually.

        Comment


        • #5
          Quote:
          If I wanted all of the cluster traffic to go over the interface EM2 would I set bind_address to be the IP of EM2?
          Well, bind_address is for the client traffic, if you're talking about replication traffic, then the simplest way is to set wsrep_node_address to the IP of EM2. But please, try first with bind_address=0.0.0.0.
          Clients connect to 'localhost' on the nodes, does bind-address effect IST?

          thanks!

          Comment


          • #6
            Hi, i added bind_address=0.0.0.0 to the node that is currently in state DONOR(+) and restarted it, as previously it failed with this in the log:

            120603 16:34:27 [Note] /usr/sbin/mysqld: Normal shutdown

            120603 16:34:27 [Note] WSREP: Stop replication
            120603 16:34:27 [Note] WSREP: Closing send monitor...
            120603 16:34:27 [Note] WSREP: Closed send monitor.
            120603 16:34:27 [Note] WSREP: gcomm: terminating thread
            120603 16:34:27 [Note] WSREP: gcomm: joining thread
            120603 16:34:27 [Note] WSREP: gcomm: closing backend
            120603 16:34:27 [Note] WSREP: view(view_id(NON_PRIM,697bc732-aceb-11e1-0800-8447e7f33848,5 7) memb {
            6f6e635f-ad97-11e1-0800-9d7ed619e4ab,
            } joined {
            } left {
            } partitioned {
            697bc732-aceb-11e1-0800-8447e7f33848,
            ea097a16-ac1f-11e1-0800-2c7896169250,
            })
            120603 16:34:27 [Note] WSREP: view((empty))
            120603 16:34:27 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
            120603 16:34:27 [Note] WSREP: gcomm: closed
            120603 16:34:27 [Note] WSREP: Flow-control interval: [8, 16]
            120603 16:34:27 [Note] WSREP: Received NON-PRIMARY.
            120603 16:34:27 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 76830460)
            120603 16:34:27 [Note] WSREP: Received self-leave message.
            120603 16:34:27 [Note] WSREP: Flow-control interval: [0, 0]
            120603 16:34:27 [Note] WSREP: Received SELF-LEAVE. Closing connection.
            120603 16:34:27 [Note] WSREP: New cluster view: global state: 09ca5817-a4d4-11e1-0800-63fd2729e5d0:76830460, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 1
            120603 16:34:27 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
            120603 16:34:27 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 76830460)
            120603 16:34:27 [Note] WSREP: RECV thread exiting 0: Success
            120603 16:34:27 [Note] WSREP: New cluster view: global state: 09ca5817-a4d4-11e1-0800-63fd2729e5d0:76830460, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 1
            120603 16:34:27 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
            120603 16:34:27 [Note] WSREP: applier thread exiting (code:0)
            120603 16:34:27 [Note] WSREP: recv_thread() joined.
            120603 16:34:27 [Note] WSREP: Closing slave action queue.
            120603 16:34:27 [Note] WSREP: applier thread exiting (code:5)
            120603 16:34:29 [Note] WSREP: SST kill local trx: 10
            120603 16:34:29 [Note] WSREP: SST kill local trx: 9
            120603 16:34:29 [Note] WSREP: SST kill local trx: 8
            120603 16:34:29 [Note] WSREP: SST kill local trx: 7
            120603 16:34:29 [Note] WSREP: SST kill local trx: 6
            120603 16:34:29 [Note] WSREP: rollbacker thread exiting
            120603 16:34:29 [Note] Event Scheduler: Purging the queue. 0 events
            120603 16:34:29 [Note] WSREP: dtor state: CLOSED
            120603 16:34:29 [Note] WSREP: mon: entered 3577200 oooe fraction 0 oool fraction 0
            120603 16:34:29 [Note] WSREP: mon: entered 3577200 oooe fraction 0.193202 oool fraction 0
            120603 16:34:29 [Note] WSREP: mon: entered 3598479 oooe fraction 0 oool fraction 2.77895e-07
            120603 16:34:29 [Note] WSREP: cert index usage at exit 497
            120603 16:34:29 [Note] WSREP: cert trx map usage at exit 613
            120603 16:34:29 [Note] WSREP: deps set usage at exit 0
            120603 16:34:29 [Note] WSREP: avg deps dist 486.106
            120603 16:34:29 [Note] WSREP: wsdb trx map usage 0 conn query map usage 0
            120603 16:34:29 [Note] WSREP: Shifting CLOSED -> DESTROYED (TO: 76830460)
            120603 16:34:29 [Note] WSREP: Flushing memory map to disk...
            120603 16:34:29 InnoDB: Starting shutdown...
            120603 16:34:30 InnoDB: Waiting for 200 pages to be flushed
            120603 16:34:35 InnoDB: Shutdown completed; log sequence number 75136912616
            120603 16:34:35 [Note] /usr/sbin/mysqld: Shutdown complete

            120603 16:34:35 mysqld_safe mysqld from pid file /var/lib/mysql/node2.cluster.net.pid ended
            120603 16:34:35 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
            120603 16:34:35 [Note] Flashcache bypass: disabled
            120603 16:34:35 [Note] Flashcache setup error is : ioctl failed

            120603 16:34:35 [Note] WSREP: Read nil XID from storage engines, skipping position init
            120603 16:34:35 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/libgalera_smm.so'
            120603 16:34:35 [Note] WSREP: wsrep_load(): Galera 2.1dev(r112) by Codership Oy loaded succesfully.
            120603 16:34:36 [ERROR] WSREP: Failed to read output of: '/sbin/ifconfig | grep -m1 -1 -E '^[a-z]?eth[0-9]' | tail -n 1 | awk '{ print $2 }' | awk -F : '{ print $2 }''
            120603 16:34:36 [ERROR] WSREP: Failed to read output of: '/sbin/ifconfig | grep -m1 -1 -E '^[a-z]?eth[0-9]' | tail -n 1 | awk '{ print $2 }' | awk -F : '{ print $2 }''
            120603 16:34:36 [Warning] WSREP: Failed to autoguess base node address
            120603 16:34:36 [Note] WSREP: Found saved state: 09ca5817-a4d4-11e1-0800-63fd2729e5d0:76830460
            120603 16:34:36 [Note] WSREP: Reusing existing '/var/lib/mysql//galera.cache'.
            120603 16:34:36 [Note] WSREP: Passing config to GCS: gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0
            ; gcs.fc_factor = 0.5; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; replicator.causal_r
            ead_timeout = PT30S; replicator.commit_order = 3
            120603 16:34:36 [Note] WSREP: Assign initial position for certification: 76830460, protocol version: -1
            120603 16:34:36 [Note] WSREP: wsrep_sst_grab()
            120603 16:34:36 [Note] WSREP: Start replication
            120603 16:34:36 [Note] WSREP: Setting initial position to 09ca5817-a4d4-11e1-0800-63fd2729e5d0:76830460
            120603 16:34:36 [Note] WSREP: protonet asio version 0
            120603 16:34:36 [Note] WSREP: backend: asio
            120603 16:34:36 [Note] WSREP: GMCast version 0
            120603 16:34:36 [Note] WSREP: (889c91d1-adbb-11e1-0800-9f2ade06365a, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
            120603 16:34:36 [Note] WSREP: (889c91d1-adbb-11e1-0800-9f2ade06365a, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
            120603 16:34:36 [Note] WSREP: EVS version 0
            120603 16:34:36 [Note] WSREP: PC version 0
            120603 16:34:36 [Note] WSREP: gcomm: connecting to group 'cluster', peer '10.1.0.4:'
            120603 16:34:36 [Note] WSREP: (889c91d1-adbb-11e1-0800-9f2ade06365a, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.1.0.2:4567
            120603 16:34:36 [Note] WSREP: (889c91d1-adbb-11e1-0800-9f2ade06365a, 'tcp://0.0.0.0:4567') turning message relay requesting off
            120603 16:34:36 [Note] WSREP: declaring 697bc732-aceb-11e1-0800-8447e7f33848 stable
            120603 16:34:36 [Note] WSREP: declaring ea097a16-ac1f-11e1-0800-2c7896169250 stable
            120603 16:34:36 [Note] WSREP: view(view_id(PRIM,697bc732-aceb-11e1-0800-8447e7f33848,59) memb {
            697bc732-aceb-11e1-0800-8447e7f33848,
            889c91d1-adbb-11e1-0800-9f2ade06365a,
            ea097a16-ac1f-11e1-0800-2c7896169250,
            } joined {
            } left {
            } partitioned {
            })
            120603 16:34:37 [Note] WSREP: gcomm: connected
            120603 16:34:37 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
            120603 16:34:37 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
            120603 16:34:37 [Note] WSREP: Opened channel 'cluster'
            120603 16:34:37 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3
            120603 16:34:37 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
            120603 16:34:37 [Note] WSREP: Waiting for SST to complete.
            120603 16:34:37 [Note] WSREP: STATE EXCHANGE: sent state msg: 88ea43ab-adbb-11e1-0800-304e79743bcc
            120603 16:34:37 [Note] WSREP: STATE EXCHANGE: got state msg: 88ea43ab-adbb-11e1-0800-304e79743bcc from 0 (node3)
            120603 16:34:37 [Note] WSREP: STATE EXCHANGE: got state msg: 88ea43ab-adbb-11e1-0800-304e79743bcc from 2 (node1)
            120603 16:34:37 [Note] WSREP: STATE EXCHANGE: got state msg: 88ea43ab-adbb-11e1-0800-304e79743bcc from 1 (node2)
            120603 16:34:37 [Note] WSREP: Quorum results:
            version = 2,
            component = PRIMARY,
            conf_id = 56,
            members = 2/3 (joined/total),
            act_id = 76832306,
            last_appl. = -1,
            protocols = 0/4/1 (gcs/repl/appl),
            group UUID = 09ca5817-a4d4-11e1-0800-63fd2729e5d0
            120603 16:34:37 [Note] WSREP: Flow-control interval: [14, 28]
            120603 16:34:37 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 76832306)
            120603 16:34:37 [Note] WSREP: State transfer required:
            Group state: 09ca5817-a4d4-11e1-0800-63fd2729e5d0:76832306
            Local state: 09ca5817-a4d4-11e1-0800-63fd2729e5d0:76830460
            120603 16:34:37 [Note] WSREP: New cluster view: global state: 09ca5817-a4d4-11e1-0800-63fd2729e5d0:76832306, view# 57: Primary, number of nodes: 3, my index: 1, protocol version 1
            120603 16:34:37 [Warning] WSREP: Gap in state sequence. Need state transfer.
            120603 16:34:39 [Note] WSREP: Running: 'wsrep_sst_rsync 'joiner' '10.1.0.3' '' '/var/lib/mysql/' '/etc/my.cnf' '23526' 2>sst.err'
            120603 16:34:39 [Note] WSREP: Prepared SST request: rsync|10.1.0.3:4444/rsync_sst
            120603 16:34:39 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
            120603 16:34:39 [Note] WSREP: Assign initial position for certification: 76832306, protocol version: 2
            120603 16:34:39 [Note] WSREP: Prepared IST receiver, listening at: tcp://[::1]:4568
            120603 16:34:39 [Note] WSREP: Node 1 (node2) requested state transfer from '*any*'. Selected 2 (node1)(SYNCED) as donor.
            120603 16:34:39 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 76833005)
            120603 16:34:39 [Note] WSREP: Requesting state transfer: success, donor: 2
            120603 16:34:39 [Warning] WSREP: 2 (node1): State transfer to 1 (node2) failed: -111 (Connection refused)
            120603 16:34:39 [ERROR] WSREP: gcs/src/gcs_group.c:gcs_group_handle_join_msg():712: Will never receive state. Need to abort.
            120603 16:34:39 [Note] WSREP: gcomm: terminating thread
            120603 16:34:39 [Note] WSREP: gcomm: joining thread
            120603 16:34:39 [Note] WSREP: gcomm: closing backend
            120603 16:34:39 [Note] WSREP: view(view_id(NON_PRIM,697bc732-aceb-11e1-0800-8447e7f33848,5 9) memb {
            889c91d1-adbb-11e1-0800-9f2ade06365a,
            } joined {
            } left {
            } partitioned {
            697bc732-aceb-11e1-0800-8447e7f33848,
            ea097a16-ac1f-11e1-0800-2c7896169250,
            })
            120603 16:34:39 [Note] WSREP: view((empty))
            120603 16:34:39 [Note] WSREP: gcomm: closed
            120603 16:34:39 [Note] WSREP: /usr/sbin/mysqld: Terminated.

            Comment


            • #7
              Thanks, looks like a bug in guessing IST receiver address, please set wsrep_node_address to the IP of the interface you want to use for replication. You also need to open port 4568 in the firewall if any.

              No, bind-address does not affect IST, but it affects guessing the address at which joiner will listen for IST.

              Comment

              Working...
              X