Announcement

Announcement Module
Collapse
No announcement yet.

two-node cluster hangs when primary node is gone (ignore-sb=true)

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • two-node cluster hangs when primary node is gone (ignore-sb=true)

    Hi,
    I have a two node Percona XtraDB Cluster (v5.5.33) with Galera setup. Galera has been configured to ignore split-brain.

    When I perform failover tests for these nodes then I see strange behaviour which I cant get a grip on.

    The following is the scenario:
    Node2 was the cluster creator. It has wsrep_cluster_address of "gcomm://", Node1 has address "gcomm://10.0.100.2" when I start this scenario.
    Node1 (10.0.100.1) and node2 (10.0.100.2) are both running and only node 1 is receiving data.
    When I reboot node1 then node2 detects this and it happily receives data. No problem here. Node1 comes back up and performs IST to rejoin the cluster.
    All still well.
    When the IST of node1 is finished and the node is ready then after several minutes I stop node2. As soon as node2 is busy with stopping Percona then node1 hangs all transactions.
    Doing a "show status like 'wsrep%';" shows me that node 1 'believes' its still part of the cluster and does not seem to detect that the 2nd node is gone.

    I'm using all innoDB tables and have a high-load on the server. Several TBs of data with 60GB configured innodb buffer pool size.

    I also tried to do a "SET GLOBAL wsrep_cluster_address='gcomm://'; " to force node2 to be the cluster creator. But alas, it does not solve the issue described above.

    Why is the node hanging? and more importantly how can I fix this?

    Many Thanks!

    My my.cnf (config of node1, node2 is the same apart from ofcourse IP-addresses) looks like :
    # GENERAL #
    user = mysql
    default-storage-engine = innodb
    socket = /data/mysql/mysql.sock
    pid-file = /data/mysql/mysql.pid

    slow-query-log = ON
    log-queries-not-using-indexes = ON
    innodb_print_all_deadlocks = ON

    max_allowed_packet = 120M
    max_connect_errors = 2000000000000
    skip-name-resolve

    sysdate-is-now = 1
    innodb = FORCE
    innodb-strict-mode = 1

    datadir = /data/mysql
    tmpdir = /data/mysql-tmp

    log-bin = /data/mysql/mysql-bin
    expire-logs-days = 5
    sync-binlog = 1

    log-slave-updates = 1
    relay-log = /data/mysql/relay-bin
    slave-net-timeout = 60
    sync-master-info = 1
    sync-relay-log = 1
    sync-relay-log-info = 1

    tmp-table-size = 32M
    max-heap-table-size = 32M
    query-cache-type = 0
    query-cache-size = 0

    max-connections = 1000
    thread-cache-size = 50
    open-files-limit = 65535
    table-definition-cache = 1024
    table-open-cache = 1000

    innodb-flush-method = O_DIRECT
    innodb-log-files-in-group = 2
    innodb-log-file-size = 512M
    innodb-flush-log-at-trx-commit = 1
    innodb-file-per-table = 1

    innodb-buffer-pool-size = 60G

    server-id = 1
    binlog_format=ROW

    innodb_autoinc_lock_mode=2
    innodb_locks_unsafe_for_binlog=1
    bind-address=0.0.0.0

    wsrep_provider="/usr/lib/libgalera_smm.so"
    wsrep_provider_options="pc.ignore_sb = yes; evs.keepalive_period = PT1S; evs.inactive_check_period = PT1S; evs.suspect_timeout = PT5S; evs.inactive_timeout = PT10S; evs.install_timeout = PT10S; gcache.size=32G"

    wsrep_cluster_name="percona_cluster"
    wsrep_cluster_address=gcomm://10.0.100.2

    wsrep_node_name=node1
    wsrep_node_address=10.0.100.1

    wsrep_slave_threads=16

    wsrep_certify_nonPK=1
    wsrep_max_ws_rows=131072
    wsrep_max_ws_size=1073741824
    wsrep_debug=0
    wsrep_convert_LOCK_to_trx=0
    wsrep_retry_autocommit=1
    wsrep_auto_increment_control=1
    wsrep_drupal_282555_workaround=0

    wsrep_causal_reads=0
    wsrep_notify_cmd=

    wsrep_sst_method=xtrabackup
    wsrep_sst_auth=mysql_sst:*********

    # Desired SST donor name.
    #wsrep_sst_donor=

    # Reject client queries when donating SST (false)
    #wsrep_sst_donor_rejects_queries=0

    # Protocol version to use
    # wsrep_protocol_version=

  • #2
    If you had bootstrapped node2 then try starting again first node2(let it start completely) and then node1,
    how did you come to conclusion that it was hanging!?, you say its high load, and with innodb you can expect slowness.
    What the mysql error log says..?
    Last edited by madhusudan; 11-21-2013, 07:14 AM.

    Comment

    Working...
    X