Too many connections brings down cluster

  • Filter
  • Time
  • Show
Clear All
new posts

  • Too many connections brings down cluster

    Recently converted our IPB forum from a single server to a 3-server PXC balanced through HAProxy (leastconn.) All servers are VMs in the same DC so latency between them is minimal. The forum is configured to write to all 3 servers.

    I constantly see "Too many connections" warning in the logs, but several times per week the cluster becomes completely unresponsive. There is usually one or two servers that exhausted max_connections. The only solution is to shut the cluster down completely and bootstrap from scratch.

    This doesn't seem related to the load on the forum and I've never seen this error when using a single server.

    I already increased max_connections from 500 to 1000 and the issue still occurs.

    I currently set HAProxy to send all SQL traffic to a single server in the cluster and so far, no crashes.

    Is multi-master setup just not reliable enough to use in production?
    Are there any settings to improve stability even at the cost of replication speed?
    How would I troubleshoot this other than monitoring number of active connections?


  • #2
    I think the first what you should do is to check what are all those connections doing. Is it only the issue with closing them or you were experiencing some queries pile-ups, where there were too many threads running, or blocked on something?
    The output from
    SHOW STATUS LIKE 'Threads%';
    and some simple stats from information_schema.processlist, like:
    HTML Code:
    SELECT substring_index(host, ':',1) AS host_name,state,count(*) FROM information_schema.processlist  GROUP BY state,host_name;
    could be a good start of investigation.


    • #3
      First I want to mention that problem goes away completely when all traffic is directed to a single node in HAProxy. Has been running without any problems for a week. Of course, that completely defeats the point of running a cluster, but at least it confirms that problem is caused by accessing multiple nodes for r/w at the same time.

      Here's the output of the commands you mentioned from the server currently handling all traffic:
      mysql> SHOW STATUS LIKE 'Threads%';
      | Variable_name     | Value |
      | Threads_cached    | 47    |
      | Threads_connected | 5     |
      | Threads_created   | 7466  |
      | Threads_running   | 3     |
      4 rows in set (0.00 sec)
      mysql> SELECT substring_index(host, ':',1) AS host_name,state,count(*) FROM information_schema.processlist  GROUP BY state,host_name;
      | host_name    | state                     | count(*) |
      | |                           |        2 |
      |              | committed 13012839        |        1 |
      | localhost    | executing                 |        1 |
      |              | wsrep aborter idle        |        1 |
      | | wsrep in pre-commit stage |        1 |
      5 rows in set (0.01 sec)


      • #4
        After another week, it seems that if at least one node is not participating in LB rotation, cluster remains stable.
        Maybe Percona engineers can explain this behavior?


        • #5
          daq: We'd have to get a better picture of what the problem looks like. We've setup plenty of HAproxy configurations where all nodes are in the load balancing pool. It must be something specific to your environment.


          • #6
            Same thing happened to me as well. Since last month, suddenly our percona cluster started to have connection issues and we tried to fix it by applying following settings.

            tcp_tw_recycle=0 tcp_tw_reuse=10 First we edited the both settings but then we had issues even connecting to the server over ssh. So we disabled the tcp_tw_recycle and everything worked until today(12 Days).
            Initially we had 3 node cluster and we added another 2 nodes about two days ago. This morning cluster suddenly stopped replying to queries by throwing

            Lock wait timeout exceeded; try restarting transaction from every node. However we could still login to the MySQL CLI. Failure started on node3 and i was able to record TCP connection status

            EME-MYSQL-N3 #ServerName CLOSE_WAIT 256 ESTABLISHED 73 TIME_WAIT 1 The MySQL error log on Node1 and Node3 can be found here


            • #7
              Just to chime in - we have a similar issue with our cluster setup (5 nodes + HA). It all seems to be working fine, but the cluster gets too many connections on every node regularly. It's just stops finishing the threads and they just hang forever and connections multiply as users make requests.

              We have since dropped to only one node and others are for backup. Currently we have fallen to just one node (running in bootstrapped mode) but these issues still occur, albeit less frequently (once in 3-4 days approx). The server is running smoothly, never with more than 10 active queries at the time.

              Then, suddenly, it justs stops. The queries are never finished (in various states) and they multiply as users make new requests. There's nothing at all in error log (and we log warnings as well). The only solution is to restart (killing processes doesn't make new ones be processed).

              We have run standalone Percona Server for a long time and only recently switched to Cluster version. Never had these glitches in standalone mode. The configuration remained basically the same.


              • #8
                I'm having this same problem, I'm using the Percona XtraDB Cluster 5.6. When the server is running in standalone no problems, but when another active node it starts to queue requests to infinity and mostly: wsrep in pre-commit stage, it is as if the server would remain in lock, but replication itself works because I can replicate a database for example .. only external access that are compromised.


                • #9
                  Hello there! So przemek, I'm having this same problem... Not have a solution or a fix?


                  • #10
                    Guys, in each case it may be a different problem
                    What does the "show status like 'wsrep%';" show, what's exactly in the processlist? Anything in error logs? Do you do any large transactions? What is the cluster status on all nodes? "Too many connections" or "stopped" does not tell us much. Try to investigate the system when the problem is happening. pt-stalk tool or https://github.com/jayjanssen/myq_gadgets may help in that.


                    • #11

                      what's your ulimit?
                      ulimit -n

                      for just mysqld process:
                      cat /proc/<pid>/limits

                      The size of "Max open files" would be interesting!


                      • #12
                        The same issue happened to me. I've deployed a 3 node cluster using Percona XtraDB 5.6

                        We have run our cluster for days without any problem, but some day we faced this problem: we ran out of connections and the cluster went off.
                        I did some troubleshooting and I found a lot of connections with the same state when I executed show processlist command: unauthorized user - trying to connect

                        I don't know what was causing this issue. I've called to the network guys and they said the load balancer at that time we had the problem was only with 10 concurrent sessions.

                        I tried to increase max_connections value from 2048 to 4096 ant the issue remained.

                        All these 3 nodes are physical machines with 48cpu and 128GB of memory.

                        Could someone give me some clarifications about how to solve this issue or what is happening ?


                        • #13
                          fmarinho1980, usually first thing to investigate is to take a look at the mysql's error log and current status variables. What does it mean "cluster went off"? Was the Threads_connected=max_connections? Was the cluster operational - nodes in SYNCED and Primary state? This is all too little data to see what was really happening.