GET 24/7 LIVE HELP NOW

Announcement

Announcement Module
Collapse
No announcement yet.

restart problem with XtraDB cluster nodes

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • restart problem with XtraDB cluster nodes

    Hi,

    I set up a cluster of six nodes, everything works fine but when I shut down a node I can't start it. Here are the contents of the log that appears during startup:

    terminate called after throwing an instance of 'gu::NotFound'13:43:36 UTC - mysqld got signal 6 ;This could be because you hit a bug. It is also possible that this binaryor one of the libraries it was linked against is corrupt, improperly built,or misconfigured. This error can also be caused by malfunctioning hardware.We will try our best to scrape up some info that will hopefully helpdiagnose the problem, but since we have already crashed,something is definitely wrong and this may fail.key_buffer_size=0read_buffer_size=262144max_u sed_connections=0max_threads=10000thread_count=2co nnection_count=2It is possible that mysqld could use up tokey_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 7806328 K bytes of memoryHope that's ok; if not, decrease some variables in the equation.Thread pointer: 0x1557c500Attempting backtrace. You can use the following information to find outwhere mysqld died. If you see no messages after this, something wentterribly wrong...stack_bottom = 412ad0f8 thread_stack 0x40000/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x7bb1c5]/usr/sbin/mysqld(handle_fatal_signal+0x4a4)[0x693af4]/lib64/libpthread.so.0[0x31f4a0ebe0]/lib64/libc.so.6(gsignal+0x35)[0x31f4230285]/lib64/libc.so.6(abort+0x110)[0x31f4231d30]/usr/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_ handlerEv+0x114)[0x31f8ebed14]/usr/lib64/libstdc++.so.6[0x31f8ebce16]/usr/lib64/libstdc++.so.6[0x31f8ebce43]/usr/lib64/libstdc++.so.6[0x31f8ebce56]/usr/lib64/libstdc++.so.6(__cxa_call_unexpected+0x48)[0x31f8ebc8a8]/usr/lib64/libgalera_smm.so(_ZN6galera13ReplicatorSMM15prepar e_for_ISTERPvRlRK10wsrep_uuidl+0x70b)[0x2aaaacf3c88b]/usr/lib64/libgalera_smm.so(_ZN6galera13ReplicatorSMM21prepar e_state_requestEPKvlRK10wsrep_uuidl+0x13d)[0x2aaaacf3cb3d]/usr/lib64/libgalera_smm.so(_ZN6galera13ReplicatorSMM22reques t_state_transferEPvRK10wsrep_uuidlPKvl+0x35)[0x2aaaacf3cd95]/usr/lib64/libgalera_smm.so(_ZN6galera13ReplicatorSMM19proces s_conf_changeEPvRK15wsrep_view_infoiNS_10Replicato r5StateEl+0x5cb)[0x2aaaacf2d60b]/usr/lib64/libgalera_smm.so(_ZN6galera15GcsActionSource8dispa tchEPvRK10gcs_action+0x8ee)[0x2aaaacf0e5de]/usr/lib64/libgalera_smm.so(_ZN6galera15GcsActionSource7proce ssEPv+0x58)[0x2aaaacf0e898]/usr/lib64/libgalera_smm.so(_ZN6galera13ReplicatorSMM10async_ recvEPv+0xfd)[0x2aaaacf2caed]/usr/lib64/libgalera_smm.so(galera_recv+0x23)[0x2aaaacf41be3]/usr/sbin/mysqld(_Z25wsrep_replication_processP3THD+0x6b)[0x58d12b]/usr/sbin/mysqld(start_wsrep_THD+0x3f3)[0x51ffa3]/lib64/libpthread.so.0[0x31f4a0677d]/lib64/libc.so.6(clone+0x6d)[0x31f42d325d]Trying to get some variables.Some pointers may be invalid and cause the dump to abort.Query (0): is an invalid pointerConnection ID (thread ID): 2Status: NOT_KILLEDThe manual page at http://dev.mysql.com/doc/mysql/en/crashing.html containsinformation that should help you find out what is causing the crash.120411 15:43:36 mysqld_safe mysqld from pid file /var/lib/mysql/srv2.pid ended


    I found a solution for restarting the node, I delete the grastate.dat file in the datadir and then start working.

    Why this node can't start if I don't delete this grastate.dat file?

    Thank you in advance for the help

  • #2
    Hi,

    I see that a new forum called Percona XtraDB Cluster has been opened, anyone can move this thread in it ?

    After many tests, I found that this problem was related to one of nodes that is a replication slave. There is a lot of transactions on this node by the replication channel. If I remove this node from the cluster, stop/start of all other nodes works perfectly but no data are inserted in this context.

    At the beginning I have configured the replication slave node to be the first node of the cluster and then I have removed it from the cluster and I add it again so that isn't the first cluster node. Here again when I launch stop/start on a node, I can see the same error message in the mysqld.log file.

    I don't understand and I don't know how to avoid this.

    Comment


    • #3
      Edit:

      It's apparently not a replication problem. I exclude the replication slave node from the cluster, I launch a sysbench on a node and try to restart another node during this bench: same issue.

      Comment


      • #4
        So after some research on the launchpad page I finally managed to fix my problem.

        In most tutorials, values to define the wsrep_provider_options parameter are not specified so I don't use this parameter for configuring my nodes.
        On this bug: https://bugs.launchpad.net/percona-xtradb-cluster/+bug/91497 6
        We can see that this parameter is used and the backtrace is almost the same as mine, so I decided to set this parameter with the syntax recommended in the response on this bug and now my problem is solved.

        I can insert data massively on a node and restart another node without any problems or errors.

        Now I ask me why in some context, I have this error in the mysqld.log and why the node refuse to start:

        [ERROR] WSREP: Local state seqno (20769849) is greater than group seqno (20176725): states diverged. Aborting to avoid potential data loss. Remove '/var/lib/mysql//grastate.dat' file and restart if you wish to continue. (FATAL)
        at galera/src/replicator_str.cpp:state_transfer_required():34

        But it's not a backtrace and if I understand what happens, when mysql has a failure and he recovers then, in terms of the value on wsrep_cluster_address, a new cluster may be creates (if wsrep_cluster_address=gcomm://) and when I want that this node goes back to the good cluster, the seqno diverges. Am I right ?

        Anyone can communicate with me ? I feel so alone in this thread

        Comment


        • #5
          Yup totally right, you have address all the pros and cons while solving this issue. Now it is working perfectly. Thanks

          Comment


          • #6
            Quote:
            We can see that this parameter is used and the backtrace is almost the same as mine, so I decided to set this parameter with the syntax recommended in the response on this bug and now my problem is solved.
            That parameter is indeed useful in many situations when you need to alter Galera behaviour. But in your specific case the more useful variable would be wsrep_node_address, which is a one-stop to configure address for both SST and IST (and optionally a base listen port).

            Quote:
            [ERROR] WSREP: Local state seqno (20769849) is greater than group seqno (20176725): states diverged. Aborting to avoid potential data loss. Remove '/var/lib/mysql//grastate.dat' file and restart if you wish to continue. (FATAL)
            at galera/src/replicator_str.cpp:state_transfer_required():34

            But it's not a backtrace and if I understand what happens, when mysql has a failure and he recovers then, in terms of the value on wsrep_cluster_address, a new cluster may be creates (if wsrep_cluster_address=gcomm://) and when I want that this node goes back to the good cluster, the seqno diverges. Am I right ?
            You seem to be. Never leave gcomm:// in my.cnf. It is only for bootstrapping a new cluster. You were lucky this time: if local seqno would not be greater than group seqno, this diverged state would not be detected and you may have ended up with inconsistent nodes.

            Comment


            • #7
              For answering to your suggestions, both parameters: wsrep_node_address and wsrep_node_address were used and correctly configured at the beginning of the XtraDB Cluster use.

              For wsrep_node_address, I've used the IP of the local node and this parameter alone doesn't solve the problem, I need to configure wsrep_provider_options for allow the node to restart correctly.

              For wsrep_cluster_address I've used gcomm:// for the first node only (bootstrapping).Next I've never configured wsrep_node_address without IP address of a working node when I launch a new node or restart it.
              And that 's in this context that I had this issue: [ERROR] WSREP: Local state seqno (20769849) is greater than group seqno (20176725): states diverged ...

              So it's not a bad config of the cluster, it's really a need for me for have a working environment.

              Thanks even so.

              Comment

              Working...
              X