GET 24/7 LIVE HELP NOW

Announcement

Announcement Module
Collapse
No announcement yet.

PXDBC Strange stale of nodes

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • PXDBC Strange stale of nodes

    Hi,

    We suffered a strange behavior this afternoon on our production PXDB 3 nodes 5.5 Cluster :
    - two nodes on three were having quickly their processlist full (max_connections is set to 500), after quick analysis at the top of the 3rd node's processlist which was having its processlist growing but not yet full, we found these requests :
    | 2222493 | xxxxxx | anubis-priv:42860 | doukbev2 | Query | 650 | wsrep in pre-commit stage | UPDATE `bedoukv2`.`devis` SET `parent_id` ="63553" WHERE `devis`.`devis_id` =63558 | 0 | 1 | 1 |
    | 2226375 | root | 192.168.%.%:33099 | doukbev2 | Query | 650 | wsrep in pre-commit stage | UPDATE temp_source_children SET used = 1 | 0 | 6949 | 6949 |
    | 2226411 | root | 192.168.%.%:33253 | doukbev2 | Query | 648 | Preparing for TO isolation | TRUNCATE TABLE temp_source_children | 0 | 7 | 7 |

    To solve the situation, we tried to kill some requests but we were not able to kill the many requests that were waiting on all nodes, each time after trying to kill queries we were having their state changing to "Killed" but queries were staying there in the processlist.
    Finally, we tried to shutdown one of the nodes to see if it will have an effect on Killed queries or free some slots on other nodes but nothin happened, we finally were forced to shutdown all others nodes one bye one

    and finally get the whole cluster down seeing a total downtime for web services in front ...
    After bootstraping the cluster and restarting nodes one by one to have them getting an SST from the first one, we finally get the cluster back up and running ... with a total downtime of almost 30mins.

    Within the logs of the second node on which the queries above were found, I found these entries that appeared just after initiating the shutdown :

    140318 17:39:28 [Note] /usr/sbin/mysqld: Normal shutdown

    140318 17:39:28 [Note] WSREP: Stop replication
    140318 17:39:28 [Note] WSREP: Closing send monitor...
    140318 17:39:28 [Warning] WSREP: TO isolation failed for: 3, sql: TRUNCATE TABLE temp_source_children. Check wsrep connection state and retry the query.
    140318 17:39:28 [Note] WSREP: Closed send monitor.
    140318 17:39:28 [Note] WSREP: gcomm: terminating thread
    140318 17:39:28 [Note] WSREP: gcomm: joining thread
    140318 17:39:28 [Note] WSREP: gcomm: closing backend
    140318 17:39:28 [Warning] WSREP: TO isolation failed for: 3, sql: TRUNCATE TABLE temp_source_children. Check wsrep connection state and retry the query.


    For more informations : this TRUNCATE statement is ran within a stored procedure.
    So my interrogations are :
    - Do you think this TRUNCATE can be the cause of all the problems we got on the cluster ?
    - Do you thin it is proper and common to run a TRUNCATE within a stored procedure please ? (I saw that TRUNCATE command is not even popular/recommended by many MySQL users ...)

    - Do you have any preconisations on the way to act in such cases with Percona XtraDB Cluster ? Maybe there's some better actions to do to prevent this problem to occur again or to get the cluster back and running without having to shutdown all nodes ?
    - Any more informations needed to investigate on this problem ? I did not find any bug entry on launchpad or cases on forums that could be very related at the moment ... I also have error logs at disposal but there are pretty huge so I can provide them if need for more indeep investigations to anyone who wants them.

    Thanks in advance for your reads and suggestions/solutions on this !

    Regards


    Laurent MINOST
Working...
X