GET 24/7 LIVE HELP NOW

Announcement

Announcement Module
Collapse
No announcement yet.

trouble of starting the cluster after a crash

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • trouble of starting the cluster after a crash

    Hello;

    I had a crash on my donor node so I wanted to pass one of the machines in the cluster node as donor.

    The configuration file for the node that must pass donor is below:
    # The MySQL server
    [mysqld]
    port = 3306
    socket = /var/lib/mysql/mysql.sock
    #skip-external-locking
    key_buffer_size = 384M
    max_allowed_packet = 1M
    table_open_cache = 512
    sort_buffer_size = 2M
    read_buffer_size = 2M
    read_rnd_buffer_size = 8M
    myisam_sort_buffer_size = 64M
    thread_cache_size = 8
    query_cache_size = 32M
    # Try number of CPU's*2 for thread_concurrency
    thread_concurrency = 8
    max_connections=10000
    max_connect_errors=10000


    ################################################## #####Configuration Percona
    wsrep_provider_options=gmcast.listen_addr=tcp://0.0.0.0:4567

    wsrep_cluster_address=gcomm://node1,node2,node3

    datadir=/var/lib/mysql
    user=mysql
    # Path to Galera library
    wsrep_provider=/usr/lib64/libgalera_smm.so


    # In order for Galera to work correctly binlog format should be ROW
    binlog_format=ROW

    # MyISAM storage engine has only experimental support
    default_storage_engine=InnoDB
    # This is a recommended tuning variable for performance
    innodb_locks_unsafe_for_binlog=1

    # This changes how InnoDB autoincrement locks are managed and is a requirement for Galera
    innodb_autoinc_lock_mode=2

    # Node #3 address
    wsrep_node_address=@ip_node3

    # SST method
    wsrep_sst_method=xtrabackup


    # Cluster name
    wsrep_cluster_name=cluster_name

    # Authentication for SST method
    wsrep_sst_auth="sstuser:PWD"

    wsrep_sst_donor= node3

    server-id = 3
    When I restart the new node donor without any parameters with / etc / init.d / mysql restart - wsrep-cluster-address = gcomm :/ /, mysql starts without problem.

    Case 1: Now when I want to add other machines in the cluster.Et I get the following error:

    140123 11:26:27 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
    140123 11:26:27 mysqld_safe WSREP: Running position recovery with --log_error=/tmp/tmp.cOxHIxHCdo
    140123 11:26:33 mysqld_safe WSREP: Recovered position c9a557c5-7eb7-11e3-0800-b30c642d54fb:631798
    140123 11:26:33 [Note] WSREP: wsrep_start_position var submitted: 'c9a557c5-7eb7-11e3-0800-b30c642d54fb:631798'
    140123 11:26:33 [Note] WSREP: Read nil XID from storage engines, skipping position init
    140123 11:26:33 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/libgalera_smm.so'
    140123 11:26:33 [Note] WSREP: wsrep_load(): Galera 2.5(r150) by Codership Oy <info@codership.com> loaded succesfully.
    140123 11:26:33 [Note] WSREP: Found saved state: c9a557c5-7eb7-11e3-0800-b30c642d54fb:-1
    140123 11:26:33 [Note] WSREP: Reusing existing '/var/lib/mysql//galera.cache'.
    140123 11:26:33 [Note] WSREP: Passing config to GCS: base_host = 10.xx.xx.xx; base_port = 4567; cert.log_conflicts = no; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; gmcast.listen_addr = tcp://0.0.0.0:4567; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3
    140123 11:26:33 [Note] WSREP: Assign initial position for certification: 631798, protocol version: -1
    140123 11:26:33 [Note] WSREP: wsrep_sst_grab()
    140123 11:26:33 [Note] WSREP: Start replication
    140123 11:26:33 [Note] WSREP: Setting initial position to c9a557c5-7eb7-11e3-0800-b30c642d54fb:631798
    140123 11:26:33 [Note] WSREP: protonet asio version 0
    140123 11:26:33 [Note] WSREP: backend: asio
    140123 11:26:33 [Note] WSREP: GMCast version 0
    140123 11:26:33 [Note] WSREP: (d4fd9a88-8418-11e3-0800-e6447370e946, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
    140123 11:26:33 [Note] WSREP: (d4fd9a88-8418-11e3-0800-e6447370e946, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
    140123 11:26:33 [Note] WSREP: EVS version 0
    140123 11:26:33 [Note] WSREP: PC version 0
    140123 11:26:33 [Note] WSREP: gcomm: connecting to group 'cluster_name', peer 'node3:4567'
    140123 11:26:33 [Warning] WSREP: (d4fd9a88-8418-11e3-0800-e6447370e946, 'tcp://0.0.0.0:4567') address 'tcp://node3:4567' points to own listening address, blacklisting
    140123 11:26:36 [Warning] WSREP: no nodes coming from prim view, prim not possible
    140123 11:26:36 [Note] WSREP: view(view_id(NON_PRIM,d4fd9a88-8418-11e3-0800-e6447370e946,1) memb {
    d4fd9a88-8418-11e3-0800-e6447370e946,
    } joined {
    } left {
    } partitioned {
    })
    140123 11:26:37 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
    140123 11:27:06 [Note] WSREP: view((empty))
    140123 11:27:06 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
    at gcomm/src/pc.cpp:connect():139
    140123 11:27:06 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():195: Failed to open backend connection: -110 (Connection timed out)
    140123 11:27:06 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1290: Failed to open channel 'cluster_name' at 'gcomm://node3:4567': -110 (Connection timed out)
    140123 11:27:06 [ERROR] WSREP: gcs connect failed: Connection timed out
    140123 11:27:06 [ERROR] WSREP: wsrep::connect() failed: 6
    140123 11:27:06 [ERROR] Aborting

    140123 11:27:06 [Note] WSREP: Service disconnected.
    140123 11:27:07 [Note] WSREP: Some threads may fail to exit.
    140123 11:27:07 [Note] /usr/sbin/mysqld: Shutdown complete

    140123 11:27:07 mysqld_safe mysqld from pid file /var/lib/mysql/node3.pid ended

    Case 2: I start the donor machine and 2 other machines I add wsrep-cluster-address = gcomm :/ / donor_node on all machines but I get the following error:

    140123 14:40:49 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
    140123 14:40:49 mysqld_safe WSREP: Running position recovery with --log_error=/tmp/tmp.IxqyTq9fv5
    140123 14:40:55 mysqld_safe WSREP: Recovered position 693ba945-81de-11e3-0800-277c5cbcb35b:0
    140123 14:40:55 [Note] WSREP: wsrep_start_position var submitted: '693ba945-81de-11e3-0800-277c5cbcb35b:0'
    140123 14:40:55 [Note] WSREP: Read nil XID from storage engines, skipping position init
    140123 14:40:55 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/libgalera_smm.so'
    140123 14:40:55 [Note] WSREP: wsrep_load(): Galera 2.5(r150) by Codership Oy <info@codership.com> loaded succesfully.
    140123 14:40:55 [Note] WSREP: Found saved state: 693ba945-81de-11e3-0800-277c5cbcb35b:-1
    140123 14:40:55 [Note] WSREP: Reusing existing '/var/lib/mysql//galera.cache'.
    140123 14:40:55 [Note] WSREP: Passing config to GCS: base_host = node1; base_port = 4567; cert.log_conflicts = no; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; gmcast.listen_addr = tcp://0.0.0.0:4567; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3
    140123 14:40:55 [Note] WSREP: Assign initial position for certification: 0, protocol version: -1
    140123 14:40:55 [Note] WSREP: wsrep_sst_grab()
    140123 14:40:55 [Note] WSREP: Start replication
    140123 14:40:55 [Note] WSREP: Setting initial position to 693ba945-81de-11e3-0800-277c5cbcb35b:0
    140123 14:40:55 [Note] WSREP: protonet asio version 0
    140123 14:40:55 [Note] WSREP: backend: asio
    140123 14:40:55 [Note] WSREP: GMCast version 0
    140123 14:40:55 [Note] WSREP: (fbbe9aa8-8433-11e3-0800-cbe33796b957, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
    140123 14:40:55 [Note] WSREP: (fbbe9aa8-8433-11e3-0800-cbe33796b957, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
    140123 14:40:55 [Note] WSREP: EVS version 0
    140123 14:40:55 [Note] WSREP: PC version 0
    140123 14:40:55 [Note] WSREP: gcomm: connecting to group 'cluster_name', peer '10.128.26.154:4567'
    140123 14:40:55 [Note] WSREP: declaring 3c8938b5-841d-11e3-0800-fac3799a7d7e stable
    140123 14:40:55 [Note] WSREP: Node 3c8938b5-841d-11e3-0800-fac3799a7d7e state prim
    140123 14:40:55 [Note] WSREP: view(view_id(PRIM,3c8938b5-841d-11e3-0800-fac3799a7d7e,2) memb {
    3c8938b5-841d-11e3-0800-fac3799a7d7e,
    fbbe9aa8-8433-11e3-0800-cbe33796b957,
    } joined {
    } left {
    } partitioned {
    })
    140123 14:40:56 [Note] WSREP: gcomm: connected
    140123 14:40:56 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
    140123 14:40:56 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
    140123 14:40:56 [Note] WSREP: Opened channel 'cluster_name'
    140123 14:40:56 [Note] WSREP: Waiting for SST to complete.
    140123 14:40:56 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
    140123 14:40:56 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
    140123 14:40:56 [Note] WSREP: STATE EXCHANGE: sent state msg: fc0c0787-8433-11e3-0800-6c8d004124e1
    140123 14:40:56 [Note] WSREP: STATE EXCHANGE: got state msg: fc0c0787-8433-11e3-0800-6c8d004124e1 from 0 (xts-priv-xtsinf-pp-percona3)
    140123 14:40:56 [Note] WSREP: STATE EXCHANGE: got state msg: fc0c0787-8433-11e3-0800-6c8d004124e1 from 1 (node1)
    140123 14:40:56 [Note] WSREP: Quorum results:
    version = 2,
    component = PRIMARY,
    conf_id = 1,
    members = 1/2 (joined/total),
    act_id = 669294,
    last_appl. = -1,
    protocols = 0/4/2 (gcs/repl/appl),
    group UUID = c9a557c5-7eb7-11e3-0800-b30c642d54fb
    140123 14:40:56 [Note] WSREP: Flow-control interval: [23, 23]
    140123 14:40:56 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 669294)
    140123 14:40:56 [Note] WSREP: State transfer required:
    Group state: c9a557c5-7eb7-11e3-0800-b30c642d54fb:669294
    Local state: 693ba945-81de-11e3-0800-277c5cbcb35b:0
    140123 14:40:56 [Note] WSREP: New cluster view: global state: c9a557c5-7eb7-11e3-0800-b30c642d54fb:669294, view# 2: Primary, number of nodes: 2, my index: 1, protocol version 2
    140123 14:40:56 [Warning] WSREP: Gap in state sequence. Need state transfer.
    140123 14:40:58 [Note] WSREP: Running: 'wsrep_sst_xtrabackup --role 'joiner' --address 'node1' --auth 'sstuser:PWD' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --parent '30106''
    nc: Address already in use
    tar: This does not look like a tar archive
    tar: Exiting with failure status due to previous errors
    140123 14:40:58 [Note] WSREP: Prepared SST request: xtrabackup|node1:4444/xtrabackup_sst
    140123 14:40:58 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
    140123 14:40:58 [Note] WSREP: Assign initial position for certification: 669294, protocol version: 2
    140123 14:40:58 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (693ba945-81de-11e3-0800-277c5cbcb35b) does not match group state UUID (c9a557c5-7eb7-11e3-0800-b30c642d54fb): 1 (Operation not permitted)
    at galera/src/replicator_str.cpprepare_for_IST():436. IST will be unavailable.
    140123 14:40:58 [ERROR] WSREP: Requesting state transfer failed: -113(No route to host)
    140123 14:40:58 [ERROR] WSREP: State transfer request failed unrecoverably: 113 (No route to host). Most likely it is due to inability to communicate with the cluster primary component. Restart required.
    140123 14:40:58 [Note] WSREP: Closing send monitor...
    140123 14:40:58 [Note] WSREP: Closed send monitor.
    140123 14:40:58 [Note] WSREP: gcomm: terminating thread
    140123 14:40:58 [Note] WSREP: gcomm: joining thread
    140123 14:40:58 [Note] WSREP: gcomm: closing backend
    WSREP_SST: [ERROR] Error while getting st data from donor node: 1, 2 (20140123 14:40:58.423)
    140123 14:40:58 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup --role 'joiner' --address 'node1' --auth 'sstuser:PWD' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --parent '30106': 32 (Broken pipe)
    140123 14:40:58 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
    140123 14:40:58 [ERROR] WSREP: SST failed: 32 (Broken pipe)
    140123 14:40:58 [ERROR] Aborting

    140123 14:40:59 [Note] WSREP: view(view_id(NON_PRIM,3c8938b5-841d-11e3-0800-fac3799a7d7e,2) memb {
    fbbe9aa8-8433-11e3-0800-cbe33796b957,
    } joined {
    } left {
    } partitioned {
    3c8938b5-841d-11e3-0800-fac3799a7d7e,
    })
    140123 14:40:59 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
    140123 14:40:59 [Note] WSREP: view((empty))
    140123 14:40:59 [Note] WSREP: gcomm: closed
    140123 14:40:59 [Note] WSREP: Flow-control interval: [16, 16]
    140123 14:40:59 [Note] WSREP: Received NON-PRIMARY.
    140123 14:40:59 [Note] WSREP: Shifting PRIMARY -> OPEN (TO: 669298)
    140123 14:40:59 [Note] WSREP: Received self-leave message.
    140123 14:40:59 [Note] WSREP: Flow-control interval: [0, 0]
    140123 14:40:59 [Note] WSREP: Received SELF-LEAVE. Closing connection.
    140123 14:40:59 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 669298)
    140123 14:40:59 [Note] WSREP: RECV thread exiting 0: Success
    140123 14:40:59 [Note] WSREP: recv_thread() joined.
    140123 14:40:59 [Note] WSREP: Closing slave action queue.
    140123 14:40:59 [Note] WSREP: /usr/sbin/mysqld: Terminated.
    140123 14:40:59 mysqld_safe mysqld from pid file /var/lib/mysql/node1.pid ended

    The configurations of the other two machines are as follows:


    wsrep_provider_options="gmcast.listen_addr=tcp://0.0.0.0:4567"

    wsrep_cluster_address=gcomm://node3:4567

    datadir=/var/lib/mysql

    user=mysql

    # Path to Galera library
    wsrep_provider=/usr/lib64/libgalera_smm.so

    # In order for Galera to work correctly binlog format should be ROW
    binlog_format=ROW

    # MyISAM storage engine has only experimental support
    default_storage_engine=InnoDB

    # This is a recommended tuning variable for performance
    innodb_locks_unsafe_for_binlog=1

    # This changes how InnoDB autoincrement locks are managed and is a requirement for Galera
    innodb_autoinc_lock_mode=2

    # Node #1 address
    wsrep_node_address=@ip_node

    # SST method
    wsrep_sst_method=xtrabackup

    # Cluster name
    wsrep_cluster_name=cluster_name
    # Authentication for SST method
    wsrep_sst_auth="sstuser:MDP"
    wsrep_sst_donor=node3

    # Don't listen on a TCP/IP port at all. This can be a security enhancement,
    # if all processes that need to connect to mysqld run on the same host.
    # All interaction with mysqld must be made via Unix sockets or named pipes.
    # Note that using this option without enabling named pipes on Windows
    # (via the "enable-named-pipe" option) will render mysqld useless!
    #
    #skip-networking

    # Replication Master Server (default)
    # binary logging is required for replication
    log-bin=mysql-bin
Working...
X