Announcement

Announcement Module
Collapse
No announcement yet.

Issue with joining to bootstrapped node

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Issue with joining to bootstrapped node

    Hello,

    We have recently tried and tested Percona XtraDB cluster in lab for some time and decided to roll it into production.
    The production environment is consisted of 5 servers.
    Only the main one is read&write node, and others are for backup via HAProxy.

    The initial migration from standalone node (Percona Mysql Server) went fine with no glitches, and soon we had all 5 nodes up&running.

    A week later some serious issues started to occur.
    We have rolled a new web server (pure clone of the old one, just different IP). It worked fine alongside old one (behind haproxy).
    We forgot to change one connection string on a subsite , and as soon as it went online , ALL nodes went offline with following:

    Slave SQL: Error 'You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '? WHERE (codeId= 4967241)' at line 1' on query. Default database: '-----------'. Query: 'UPDATE `--------`.`codes` SET `codeUsed` = ? WHERE (codeId= xxxxxx)', Error_code: 1064

    After that we bootstrapped the main node and joined (full SST) other nodes. All worked until we restarted Apache, upon which same error happened (same query, different id) and all nodes went offline simultaneously.

    After that we are running bootstrapped node just fine, but cannot join other nodes. SST goes on fine, but the same error is produced on subsequential IST (DB is live) whenever we try to join a node.

    As of today, I have a even more confusing error when doing SST:

    print() on closed filehandle XTRABACKUP_PID at /usr/bin/innobackupex line 1084.

    >> log scanned up to (741995496503)
    >> log scanned up to (741995496503)
    >> log scanned up to (741995496503)
    >> log scanned up to (741995496503)
    >> log scanned up to (741995496503)

    ....which goes on forever. If i interrupt this process on joining node, donor crashes.

    my.cnf snip:


    default_storage_engine=InnoDB
    innodb_locks_unsafe_for_binlog=1
    innodb_autoinc_lock_mode=2
    wsrep_replicate_myisam = 1

    ##Per node conf
    wsrep_node_address=10.42.71.68 #IP adresa TOG node-a, ne remote.
    wsrep_sst_method=xtrabackup
    wsrep_cluster_name=Cluster


    Mysqldump works fine and without error on ALL databases at donor. All DB's in question are InnoDB (there's one myisam table for fulltext search but server never complained about it in the logs).

    But SST/IST no longer works, even after multiple restarts. I have cleared data dirs at joining nodes.
    We are running Percona-XtraDB-Cluster-5.5.31-23.7.5.438.Linux.x86_64

    Any ideas why this is happening and how to get cluster up&running again?

    Regards,
    Marko









  • #2
    No suggestions?

    Comment


    • #3
      Marko,

      Can you check exact query which causes error?

      What is the type of codeUsed filed in codes table?
      You can check it by:
      mysql> show create table codes;

      If it's string then value should be quoted:
      SET `codeUsed` = 'value'

      Comment


      • #4
        CodeUsed is smallint(6) primarily used as boolean (1/0) to check if a code has been used or it's free to use.

        I think it's a prepared statement in fact and somehow is passed to secondary node in that format. Secondary node fails because this statement isn't prepared...
        Just my theory (I have nothing better to go with) ,so I was hoping someone can give some more insight on this.

        However, this is concerning IST (after full SST) which I'm willing to work out with developers to see through the PHP code (although it works flawlessly on primary node).
        I'm now worried because even SST fails (on altogether different table) and innobackupex just loops indefinitely on "log scanned up to"...

        Comment


        • #5
          Marko,

          You can also check this bug where where log scanned up to (..) infinitely looping
          Check if you don't have something similar.
          I'm guessing that one night the system crashed while running innobackupex and left the $suspend_file, /tmp/xtrabackup_suspended, on the filesystem. It took a while for me to figure out that this then causes innobackupex to think xtrabackup has suspended itself the moment it starts, and then when xtrabackup actually does suspend itself innobackupex is waiting for it to end and does not re-remove the $suspend_file, leading to a wait deadlock. A cron job runs innobackupex, and I'm wondering if...

          Comment


          • #6
            Thanks! I found the same file on the system yesterday. I deleted it but didn't want to try another round of SST in infinite loop (usually crashes server) again without externally consulting here. Will try on monday, hopefully the same IST error (the query-related one) won't repeat and I'll have a few backups to fall onto if something goes wrong.

            Comment


            • #7
              Today I went through some of the code with the developers (concerning that query in the OP).

              It is parametrized using Zend framework (quoteInto function).

              What is very strange is that the query error reported is this:
              'UPDATE `somedatabase`.`table` SET `codeUsed` = ? WHERE (codeId= 4908555)

              Now, codeUsed is always flagged 1 (to denote it's been used and is not free) via quoteInto function and codeID is id of the record/code.

              What's peculiar is that it all goes smoothly on donor node (all transactions, they are commited nicely), but breaks on IST part of state transfer on the joining node.

              Anyone knows of any issues with zend framework v1?

              BTW, the SST part was due to the Bug posted above and is resolved, thanks.

              Comment


              • #8
                We have found this by turning on general query log:

                Prepare UPDATE `somedatabase`.`table` SET `codeUsed` = ? WHERE (codeId= 111)
                Execute UPDATE `somedatabase`.`table` SET `codeUsed` = '1' WHERE (codeId= 111)

                This is generated by Zend framework functions. We have managed to force static code but I'm uncertain why the IST should break on prepared statements. I didn't find any mention of it in the documentation or elsewhere on the web.

                Comment


                • #9
                  Originally posted by marko_s View Post
                  SET `codeUsed` = '1'
                  marko_s, again - quotes. You have there quoted smallint.

                  Can you please send us again err log of donor and joiner while trying IST?

                  The first thing what comes to my mind is donor's GCache (in case if SST works and IST not).

                  Please show the output of (from donor):
                  mysql> show variables like 'wsrep_provider_options'\G

                  Can you please increase it in case if it's default (like 128M)?
                  You can change GCache size in my.cnf by using the following syntax:
                  wsrep_provider_options='gcache.size=1G'
                  Last edited by mixa; 09-01-2013, 06:33 PM.

                  Comment


                  • #10
                    Yes, it's 128M.

                    The code was generated via Zend framework, as I stated. As soon as we switched to static UPDATE/INSERT's the problem went away and we were able to IST correctly.
                    Please note the queries in question updated rows on single node correctly and were commited to DB. Only on IST did it prove troublesome.

                    Comment

                    Working...
                    X