Emergency

One node gets all queries stuck and collapse due to too many connections

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • One node gets all queries stuck and collapse due to too many connections

    Environment:
    3 node percona cluster 5.7.19-29.22-3
    Ubuntu 16.04 LTS

    Every certain time, one of the nodes gets all queries stuck and collapse due to too many connections. It is not always the same node and it is not always the same queries.

    The time between this events may vary, but we have found them between 6 hours and 10 hours.

    The only solution for us is to kill the mysqld process and start it again because shutting down the server gracefully is not possible.

    I attach relevant graphs during the event.

    Anyone with the same problem or an idea of how this could be solved?

    Kind Regards,
    Jose

    commands on the node affected commands on one of the nodes NOT affected traffic on one of the nodes NOT affected

  • #2
    You are not alone in this. There is another post (https://www.percona.com/forums/quest...-cluster/page2) that may relate to your issues. We have ended up having to setup monitoring to alert as soon as connections go over 125% of normal so we can restart the process or kill the processes and resync.

    Comment


    • #3
      At first we thought on a solution like yours but in the end we've had to replace pxc due to this problem as this is not acceptable in a production environment. We'll keep an eye on these threads for an eventual solution of this bug.

      Thank you for you answer beautivile .

      Comment


      • #4
        I presume you are hitting following issue as symptoms are same

        commit 35cee763032246f809a84b92cbe014dbfa081972
        Author: Krunal Bauskar <krunal.bauskar@percona.com>
        Date: Fri Oct 13 21:42:08 2017 +0530

        - PXC#877: PXC hangs on what looks like an internal deadlock

        - PXC protects wsrep_xxxx state variables through a native mutex
        name LOCK_wsrep_thd.

        - This mutex should be held only while checking the state
        and should be released immediately.

        - In the said bug, this mutex was being held during complete
        cleanup. Other thread with conflicting lock, booted up
        at same time and tried and invoked kill action on the said
        thread. For invoking kill action it acquired transaction mutex
        and LOCK_wsrep_thd in respective order. Said thread already
        had LOCK_wsrep_thd and tried acquring transaction mutex
        leading deadlock.


        BTW the issue was introduced in 5.7.17 and present in 5.7.19. (Will be solved in 5.7.20).
        If you are experiencing it on release before 5.7.17 or 5.6.x then this is not the issue.

        Comment


        • #5
          jroctanio,

          PXC will miss you. May be you can give it another try with 5.7.20.

          ----------------------

          Also, would love to hear any other issues you may face in future. Will try out best to resolve them.

          Comment

          Working...
          X