Announcement

Announcement Module
Collapse
No announcement yet.

Trouble Automating Snapshots & Restoration (EC2+RightScale)

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trouble Automating Snapshots & Restoration (EC2+RightScale)

    Now that I have Percona XtraDB up and working with CentOS 5.6, I've moved on the stage of automating the cluster configuration within our RightScale environment on top of EC2. I need to be able to restore a node from a snapshot and have it rejoin the cluster.

    PROBLEM: If I restore from an EBS snapshot, having set wsrep_cluster_address to the address of the other node (in a 2 node cluster) before starting mysql, mysql always fails to start and gives the following error:

    Failed to open channel 'sentry' at 'gcomm://sentry2.ourdomain.com': -110 (Connection timed out)

    I can telnet to the address from sentry1, a connection is established, so the error message is no doubt misleading and masking a different issue.


    I apologize in advance for the length of the email. I've tried to include everything relevant to the configuration.


    I'm following the advice presented in this forum topic:
    https://groups.google.com/forum/?fromgroups#!topic/codership -team/H1XqY5T8Cgo

    1) Lock all databases/tables & flush to disk
    2) Record grastate.dat (wsrep_local_state_uuid, wsrep_last_committed)
    3) xfs_freeze the filesystem
    4) execute EBS snapshot
    5) unfreeze filesystem
    6) unlink grastate.dat file
    7) free all database locks

    The galarea.cache file has been moved to ephemeral storage not on the EBS volume:
    wsrep_provider_options="gcache.dir=/mnt/mysql-binlogs; gcache.size=2097152000"


    It seems that this all works. I've confirmed the contents of grastate match the SHOW STATUS before shutdown on an inactive test cluster.

    Example of generated grastate.dat:

    # GALERA saved stateversion: 2.1uuid: d2d0ee82-b5ac-11e1-0800-b938963402d3seqno: 1cert_index:


    Here is the configuration I'm using to bootstrap a new galera cluster reference node:

    [mysqld]wsrep_provider=/usr/lib/libgalera_smm.sowsrep_provider_options="gcache.dir =/mnt/mysql-binlogs; gcache.size=2097152000"wsrep_cluster_address=gcomm :// wsrep_slave_threads=2wsrep_cluster_name=sentrywsre p_node_address=sentry1.ourdomain.comwsrep_sst_meth od=rsyncwsrep_node_name=sentry1binlog_format=ROWin nodb_locks_unsafe_for_binlog=1innodb_autoinc_lock_ mode=2


    Here is the configuration I'm using on the second node:


    [mysqld]wsrep_provider=/usr/lib/libgalera_smm.sowsrep_provider_options="gcache.dir =/mnt/mysql-binlogs; gcache.size=2097152000"#wsrep_cluster_address=gcom m://sentry1.ourdomain.com wsrep_slave_threads=2wsrep_cluster_name=sentrywsre p_node_address=sentry1.ourdomain.comwsrep_sst_meth od=rsyncwsrep_node_name=sentry2binlog_format=ROWin nodb_locks_unsafe_for_binlog=1innodb_autoinc_lock_ mode=2


    At this point, the cluster is working fine.


    | wsrep_cluster_size | 2 || wsrep_cluster_status | Primary || wsrep_connected | ON || wsrep_ready | ON |




    To restore the "sentry1" node after I launch a new instance to replace it, I replace the wsrep_cluster_address with "gcomm://sentry2.ourdomain.com" in the mysql configuration before starting the server. This is where I run into problems probably as a result of a lack of deep understanding of the Galera clustering.


    Error log below from sentry1 after a relaunch using EBS snapshot:


    120613 16:27:41 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql120613 16:27:41 [Note] Flashcache bypass: disabled120613 16:27:41 [Note] Flashcache setup error is : ioctl failed120613 16:27:41 [Note] WSREP: Read nil XID from storage engines, skipping position init120613 16:27:41 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/libgalera_smm.so'120613 16:27:41 [Note] WSREP: wsrep_load(): Galera 2.1dev(r112) by Codership Oy loaded succesfully.120613 16:27:41 [Note] WSREP: Found saved state: d2d0ee82-b5ac-11e1-0800-b938963402d3:1120613 16:27:41 [Note] WSREP: Preallocating 2097153312/2097153312 bytes in '/mnt/mysql-binlogs/galera.cache'...120613 16:28:47 [Note] WSREP: Passing config to GCS: base_host = sentry1.ourdomain.com; gcache.dir = /mnt/mysql-binlogs; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /mnt/mysql-binlogs/galera.cache; gcache.page_size = 128M; gcache.size = 2097152000; gcs.fc_debug = 0; gcs.fc_factor = 0.5; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 2147483647; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3120613 16:28:48 [Note] WSREP: Assign initial position for certification: 1, protocol version: -1120613 16:28:48 [Note] WSREP: wsrep_sst_grab()120613 16:28:48 [Note] WSREP: Start replication120613 16:28:48 [Note] WSREP: Setting initial position to d2d0ee82-b5ac-11e1-0800-b938963402d3:1120613 16:28:48 [Note] WSREP: protonet asio version 0120613 16:28:48 [Note] WSREP: backend: asio120613 16:28:48 [Note] WSREP: GMCast version 0120613 16:28:48 [Note] WSREP: (86f87d10-b5af-11e1-0800-f871301c8bc4, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567120613 16:28:48 [Note] WSREP: (86f87d10-b5af-11e1-0800-f871301c8bc4, 'tcp://0.0.0.0:4567') multicast: , ttl: 1120613 16:28:48 [Note] WSREP: EVS version 0120613 16:28:48 [Note] WSREP: PC version 0120613 16:28:48 [Note] WSREP: gcomm: connecting to group 'sentry', peer 'sentry2.ourdomain.com:'120613 16:28:50 [Note] WSREP: declaring bfd074ba-b5ac-11e1-0800-bb10122d629d stable120613 16:28:50 [Note] WSREP: view(view_id(NON_PRIM,86f87d10-b5af-11e1-0800-f871301c8bc4,5) memb { 86f87d10-b5af-11e1-0800-f871301c8bc4, bfd074ba-b5ac-11e1-0800-bb10122d629d,} joined {} left {} partitioned { d2c537e0-b5ac-11e1-0800-ddbf4b2d9ba6,})120613 16:28:50 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check120613 16:29:20 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out) at gcomm/src/pc.cpp:connect():148120613 16:29:20 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():195: Failed to open backend connection: -110 (Connection timed out)120613 16:29:20 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1290: Failed to open channel 'sentry' at 'gcomm://sentry2.ourdomain.com': -110 (Connection timed out)120613 16:29:20 [ERROR] WSREP: gcs connect failed: Connection timed out120613 16:29:20 [ERROR] WSREP: wsrep::connect() failed: 6120613 16:29:20 [ERROR] Aborting120613 16:29:20 [Note] WSREP: Service disconnected.120613 16:29:21 [Note] WSREP: Some threads may fail to exit.120613 16:29:21 [Note] /usr/sbin/mysqld: Shutdown complete120613 16:29:21 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended


    Here I show that there are no problems for sentry1 to connect to sentry2:


    telnet sentry2.ourdomain.com 4567Trying 10.201.3.218...Connected to sentry2.ourdomain.com (10.201.3.218).Escape character is '^]'.$????t????-b?jFI???????]


    Meanwhile, on sentry2, I see the following in the logs:


    120613 16:34:39 [Note] WSREP: (bfd074ba-b5ac-11e1-0800-bb10122d629d, 'tcp://0.0.0.0:4567') reconnecting to 86f87d10-b5af-11e1-0800-f871301c8bc4 (tcp://10.206.246.66:4567), attempt 240120613 16:35:18 [Note] WSREP: (bfd074ba-b5ac-11e1-0800-bb10122d629d, 'tcp://0.0.0.0:4567') reconnecting to 86f87d10-b5af-11e1-0800-f871301c8bc4 (tcp://10.206.246.66:4567), attempt 270120613 16:35:57 [Note] WSREP: (bfd074ba-b5ac-11e1-0800-bb10122d629d, 'tcp://0.0.0.0:4567') reconnecting to 86f87d10-b5af-11e1-0800-f871301c8bc4 (tcp://10.206.246.66:4567), attempt 300


    This makes sense, however, since sentry1 will not start.


    Software packages:

    Percona-XtraDB-Cluster-shared-5.5.23-23.5.333.rhel5Percona-XtraDB-Cluster-server-5.5.23-23.5.333.rhel5Percona-XtraDB-Cluster-devel-5.5.23-23.5.333.rhel5Percona-XtraDB-Cluster-client-5.5.23-23.5.333.rhel5Percona-XtraDB-Cluster-galera-2.0-1.112.rhel5libstdc++44-devel-4.4.4-13.el5gcc-c++-4.1.2-50.el5gcc44-c++-4.4.4-13.el5libstdc++-4.1.2-50.el5libstdc++-devel-4.1.2-50.el5gcc-objc++-4.1.2-50.el5compat-libstdc++-296-2.96-138




    Any help is much appreciated!

    Thanks,

    Erik Osterman

  • #2
    I should also note, that through the process of relaunching, "sentry1" will obtain a new IP address. When shutting down "sentry1" to relaunch, nothing more than a "service mysqld stop" is executed. Nothing is done to leave the cluster in an orderly fashion.

    -Erik

    Comment


    • #3
      Hi Erik,

      I think you have configured everything right (except wsrep_node_address, which is the same on both nodes) and the error you're getting is clearly unrelated to configuration. Do you have a firewall configured on either of the nodes? Does centry2 correctly resolve centry1 IP after centry1 is restarted?

      BTW, "service mysqld stop" IS leaving the cluster in an orderly fashion.

      Comment


      • #4
        I am getting a similar error.


        120917 19:59:04 [Note] WSREP: gcomm: connecting to group 'trimethylxanthine', peer '10.10.20.62:'
        120917 19:59:07 [Warning] WSREP: no nodes coming from prim view, prim not possible
        120917 19:59:07 [Note] WSREP: view(view_id(NON_PRIM,0852e93f-00d4-11e2-0800-98eb2abc5831,1 ) memb {
        0852e93f-00d4-11e2-0800-98eb2abc5831,
        } joined {
        } left {
        } partitioned {
        })
        120917 19:59:08 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
        120917 19:59:37 [Note] WSREP: view((empty))
        120917 19:59:37 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
        at gcomm/src/pc.cpp:connect():157
        120917 19:59:37 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():195: Failed to open backend connection: -110 (Connection timed out)
        120917 19:59:37 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1290: Failed to open channel 'trimethylxanthine' at 'gcomm://10.10.20.62': -110 (Connection timed out)
        120917 19:59:37 [ERROR] WSREP: gcs connect failed: Connection timed out
        120917 19:59:37 [ERROR] WSREP: wsrep::connect() failed: 6
        120917 19:59:37 [ERROR] Aborting

        *************************************

        I have 3 node cluster and configuration is done correctly. Currently the primary node is only working properly. In the other two when I try to start mysqld , it gave the error displayed above.Initially I got lot of "conflicts error" while installing Percona-Xtradb cluster on these two nodes. I also had to change some file permissions, then all 3 clusters were started and replicating for once. But Once I rebooted one of the machines ( to check if db created on other nodes will replicate) , mysqld dameon is giving the error displayed above.

        on primary node


        wsrep_provider=/usr/lib/libgalera_smm.so
        wsrep_cluster_address=gcomm://
        wsrep_slave_threads=2

        on the other two nodes


        wsrep_provider=/usr/lib/libgalera_smm.so
        wsrep_cluster_address=gcomm://10.10.20.62
        wsrep_provider_options="gmcast.listen_addr=tcp://0.0.0.0:4567"


        iptables/selinux off but still getting the same error on two nodes..Please assist

        Comment


        • #5
          1) there are no primary/secondary nodes in the cluster. It is single-rank.
          2) once you restart your "primary" node, instead of connecting to other nodes wsrep_cluster_address=gcomm:// makes it to create a new cluster once again. Most likely this is the cause of trouble. Never leave wsrep_cluster_address=gcomm:// in my.cnf after you have started the node. Set it to point to another node.
          3) apparently nobody is listening at 10.10.20.62:4567, or rather you have a firewall dropping the packets (hence connection timeout instead of connection refused). Check that you can connect to this address from "other nodes" by telnet.

          Comment


          • #6
            Thanks, although the issue is resolved , I still have some doubts.

            1. There will be one node in the cluster to which others will connect to , so that becomes the primary node right ? Unless this node is started , other nodes mysqld won't start and wont join the cluster.

            2. I tested with restarting all nodes (primary/secondary)and the issue seems to have resolved.I dint change wsrep_cluster_address=gcomm:// in primary node 10.10.20.62. But it gave a log like after restart:

            120924 13:05:40 [Note] WSREP: New cluster view: global state: 420da608-061a-11e2-0800-f990110805d6:0, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 2
            120924 13:05:40 [Note] WSREP: recv_thread() joined.
            120924 13:05:40 [Note] WSREP: Closing slave action queue.

            You are right,it is trying to create a new cluster. I had to kill the process and then start the process , then it joined the existing cluster.

            Below is the my.cnf file from 10.10.20.62 ( primary node).

            [mysqld]
            datadir=/usr/lib/mysql
            socket=/var/lib/mysql/mysql.sock
            user=mysql

            binlog-format=ROW
            wsrep_provider=/usr/lib/libgalera_smm.so
            wsrep_cluster_address=gcomm://
            wsrep_slave_threads=2
            wsrep_cluster_name=trimethylxanthine
            wsrep_sst_method=rsync
            wsrep_node_name=node1
            innodb_locks_unsafe_for_binlog=1
            innodb_autoinc_lock_mode=2


            server-id=1

            log-bin=Xeon

            # Disabling symbolic-links is recommended to prevent assorted security risks
            #symbolic-links=0

            [mysqld_safe]
            wsrep_url= gcomm://10.10.20.62:4567,gcomm://10.10.20.107:4567,gcomm://1 0.10.20.12:4567,gcomm://

            log-error=/var/log/mysqld.log
            pid-file=/var/run/mysqld/mysqld.pid
            ##################################
            What do you mean by setting the primary nodes gcomm address to some other nodes , like this:

            wsrep_cluster_address=gcomm://10.10.20.12.

            But when I set this and restart primary node , the mysqld gets terminated.

            120924 14:28:24 [ERROR] WSREP: gcs/src/gcs_group.c:gcs_group_handle_join_msg():712: Will never receive state. Need to abort.
            120924 14:28:24 [Note] WSREP: gcomm: terminating thread
            120924 14:28:24 [Note] WSREP: gcomm: joining thread
            120924 14:28:24 [Note] WSREP: gcomm: closing backend
            120924 14:28:24 [Note] WSREP: view(view_id(NON_PRIM,56249437-0618-11e2-0800-c65f1c3b2d27,1 7) memb {
            fd0c667e-0625-11e2-0800-7828ae445b92,
            } joined {
            } left {
            } partitioned {
            56249437-0618-11e2-0800-c65f1c3b2d27,
            d7963bd3-0617-11e2-0800-2286b9bc8378,
            })
            120924 14:28:24 [Note] WSREP: view((empty))
            120924 14:28:24 [Note] WSREP: gcomm: closed
            120924 14:28:24 [Note] WSREP: /usr/sbin/mysqld: Terminated.

            When I chnge it back to previous state ie

            wsrep_cluster_address=gcomm://

            then only mysqld starts and also db gets replicated.

            ##################################

            my.cnf file on other two nodes:

            [mysqld]
            datadir=/mnt/data
            socket=/var/lib/mysql/mysql.sock
            user=mysql

            binlog_format=ROW
            wsrep_provider=/usr/lib/libgalera_smm.so
            wsrep_cluster_address=gcomm://10.10.20.62
            wsrep_provider_options="gmcast.listen_addr=tcp://0.0.0.0:4567"

            wsrep_slave_threads=2
            wsrep_cluster_name=trimethylxanthine
            wsrep_sst_method=rsync
            wsrep_node_name=node3
            innodb_locks_unsafe_for_binlog=1
            innodb_autoinc_lock_mode=2

            server-id=3
            log-bin=asima

            # Disabling symbolic-links is recommended to prevent assorted security risks
            #symbolic-links=0

            [mysqld_safe]
            wsrep_urls= gcomm://10.10.20.62:4567,gcomm://10.10.20.107:4567,gcomm://1 0.10.20.12:4567,gcomm://


            log-error=/var/log/mysqld.log
            pid-file=/var/run/mysqld/mysqld.pid
            ########################################

            Can you provide me the ideal my.cnf configuration for three cluster nodes ?

            3. This was resolved. I checked with telnet and port 3306 and 4567 is open all three nodes. Can you tell me if both 3306 and 4567 is mandatory on all nodes or only one will do ? I think 4567 is used for galera API commnuication for cluster and 3306 is to write to the db . So both are required.

            Comment


            • #7
              mjp.career.08 wrote on Mon, 24 September 2012 12:07
              Thanks, although the issue is resolved , I still have some doubts.

              1. There will be one node in the cluster to which others will connect to , so that becomes the primary node right ?
              In the cluster each node is connected to every other node.

              Quote:
              Unless this node is started , other nodes mysqld won't start and wont join the cluster.
              Of course some node has to be the first to start, when there are no other nodes running. That's why for this node you have to set

              wsrep_cluster_address=gcomm://
              since you have no other nodes to connect to yet. It makes it the first node in a cluster. But it does not make it any different from the other cluster members.
              And that's why when you restart it with the same setting, it won't connect to its old mates - cause you explicitly telling it not to connect to anybody.
              If you have 10 nodes in the cluster, you can connect to ANY of them.

              Quote:
              2. I tested with restarting all nodes (primary/secondary)and the issue seems to have resolved.I dint change wsrep_cluster_address=gcomm:// in primary node 10.10.20.62. But it gave a log like after restart:

              120924 13:05:40 [Note] WSREP: New cluster view: global state: 420da608-061a-11e2-0800-f990110805d6:0, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 2
              120924 13:05:40 [Note] WSREP: recv_thread() joined.
              120924 13:05:40 [Note] WSREP: Closing slave action queue.
              This is from shutdown sequence. Every node prints that on shutdown.

              Quote:
              You are right,it is trying to create a new cluster. I had to kill the process and then start the process , then it joined the existing cluster.

              Below is the my.cnf file from 10.10.20.62 ( primary node).

              [mysqld]
              datadir=/usr/lib/mysql
              socket=/var/lib/mysql/mysql.sock
              user=mysql

              binlog-format=ROW
              wsrep_provider=/usr/lib/libgalera_smm.so
              wsrep_cluster_address=gcomm://
              wsrep_slave_threads=2
              wsrep_cluster_name=trimethylxanthine
              wsrep_sst_method=rsync
              wsrep_node_name=node1
              innodb_locks_unsafe_for_binlog=1
              innodb_autoinc_lock_mode=2


              server-id=1

              log-bin=Xeon

              # Disabling symbolic-links is recommended to prevent assorted security risks
              #symbolic-links=0

              [mysqld_safe]
              wsrep_url= gcomm://10.10.20.62:4567,gcomm://10.10.20.107:4567,gcomm://1 0.10.20.12:4567,gcomm://

              log-error=/var/log/mysqld.log
              pid-file=/var/run/mysqld/mysqld.pid
              ##################################
              What do you mean by setting the primary nodes gcomm address to some other nodes , like this:

              wsrep_cluster_address=gcomm://10.10.20.12.
              Yes, or, like in the
              [mysqld_safe] wsrep_urls=...
              section above (it overrrides wsrep_cluster_address).

              Quote:
              But when I set this and restart primary node , the mysqld gets terminated.
              Yes, you have misconfiguration somewhere and donor can't send state snapshot to that node. You have to look in the logs on the donor node for the clues. Who was the donor node is said in the log above the snippet you posted below.

              Quote:
              120924 14:28:24 [ERROR] WSREP: gcs/src/gcs_group.c:gcs_group_handle_join_msg():712: Will never receive state. Need to abort.
              120924 14:28:24 [Note] WSREP: gcomm: terminating thread
              120924 14:28:24 [Note] WSREP: gcomm: joining thread
              120924 14:28:24 [Note] WSREP: gcomm: closing backend
              120924 14:28:24 [Note] WSREP: view(view_id(NON_PRIM,56249437-0618-11e2-0800-c65f1c3b2d27,1 7) memb {
              fd0c667e-0625-11e2-0800-7828ae445b92,
              } joined {
              } left {
              } partitioned {
              56249437-0618-11e2-0800-c65f1c3b2d27,
              d7963bd3-0617-11e2-0800-2286b9bc8378,
              })
              120924 14:28:24 [Note] WSREP: view((empty))
              120924 14:28:24 [Note] WSREP: gcomm: closed
              120924 14:28:24 [Note] WSREP: /usr/sbin/mysqld: Terminated.

              When I chnge it back to previous state ie

              wsrep_cluster_address=gcomm://

              then only mysqld starts and also db gets replicated.
              This is because in that case this node totally ignores the previous state of the cluster.

              Quote:
              ##################################

              my.cnf file on other two nodes:

              [mysqld]
              datadir=/mnt/data
              socket=/var/lib/mysql/mysql.sock
              user=mysql

              binlog_format=ROW
              wsrep_provider=/usr/lib/libgalera_smm.so
              wsrep_cluster_address=gcomm://10.10.20.62
              wsrep_provider_options="gmcast.listen_addr=tcp://0.0.0.0:4567"

              wsrep_slave_threads=2
              wsrep_cluster_name=trimethylxanthine
              wsrep_sst_method=rsync
              wsrep_node_name=node3
              innodb_locks_unsafe_for_binlog=1
              innodb_autoinc_lock_mode=2

              server-id=3
              log-bin=asima

              # Disabling symbolic-links is recommended to prevent assorted security risks
              #symbolic-links=0

              [mysqld_safe]
              wsrep_urls= gcomm://10.10.20.62:4567,gcomm://10.10.20.107:4567,gcomm://1 0.10.20.12:4567,gcomm://


              log-error=/var/log/mysqld.log
              pid-file=/var/run/mysqld/mysqld.pid
              ########################################

              Can you provide me the ideal my.cnf configuration for three cluster nodes ?
              If there existed an "ideal configuration", then it would have been hardcoded into the code and no external configuration would be necessary.

              Quote:
              3. This was resolved. I checked with telnet and port 3306 and 4567 is open all three nodes. Can you tell me if both 3306 and 4567 is mandatory on all nodes or only one will do ? I think 4567 is used for galera API commnuication for cluster and 3306 is to write to the db . So both are required.
              yes, both are required.

              If you're still confused I'd advise you to attend Percona Live NY or Percona Live London events where there will be hands on training for Percona XtraDB Cluster.

              Comment

              Working...
              X