GET 24/7 LIVE HELP NOW

Announcement

Announcement Module
Collapse
No announcement yet.

Too many connections brings down cluster

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Too many connections brings down cluster

    Recently converted our IPB forum from a single server to a 3-server PXC balanced through HAProxy (leastconn.) All servers are VMs in the same DC so latency between them is minimal. The forum is configured to write to all 3 servers.

    I constantly see "Too many connections" warning in the logs, but several times per week the cluster becomes completely unresponsive. There is usually one or two servers that exhausted max_connections. The only solution is to shut the cluster down completely and bootstrap from scratch.

    This doesn't seem related to the load on the forum and I've never seen this error when using a single server.

    I already increased max_connections from 500 to 1000 and the issue still occurs.

    I currently set HAProxy to send all SQL traffic to a single server in the cluster and so far, no crashes.

    Is multi-master setup just not reliable enough to use in production?
    Are there any settings to improve stability even at the cost of replication speed?
    How would I troubleshoot this other than monitoring number of active connections?

    Thanks.

  • #2
    I think the first what you should do is to check what are all those connections doing. Is it only the issue with closing them or you were experiencing some queries pile-ups, where there were too many threads running, or blocked on something?
    The output from
    Code:
    SHOW STATUS LIKE 'Threads%';
    and some simple stats from information_schema.processlist, like:
    HTML Code:
    SELECT substring_index(host, ':',1) AS host_name,state,count(*) FROM information_schema.processlist  GROUP BY state,host_name;
    could be a good start of investigation.

    Comment


    • #3
      First I want to mention that problem goes away completely when all traffic is directed to a single node in HAProxy. Has been running without any problems for a week. Of course, that completely defeats the point of running a cluster, but at least it confirms that problem is caused by accessing multiple nodes for r/w at the same time.

      Here's the output of the commands you mentioned from the server currently handling all traffic:
      Code:
      mysql> SHOW STATUS LIKE 'Threads%';
      +-------------------+-------+
      | Variable_name     | Value |
      +-------------------+-------+
      | Threads_cached    | 47    |
      | Threads_connected | 5     |
      | Threads_created   | 7466  |
      | Threads_running   | 3     |
      +-------------------+-------+
      4 rows in set (0.00 sec)
      Code:
      mysql> SELECT substring_index(host, ':',1) AS host_name,state,count(*) FROM information_schema.processlist  GROUP BY state,host_name;
      +--------------+---------------------------+----------+
      | host_name    | state                     | count(*) |
      +--------------+---------------------------+----------+
      | 10.12.28.248 |                           |        2 |
      |              | committed 13012839        |        1 |
      | localhost    | executing                 |        1 |
      |              | wsrep aborter idle        |        1 |
      | 10.12.28.248 | wsrep in pre-commit stage |        1 |
      +--------------+---------------------------+----------+
      5 rows in set (0.01 sec)

      Comment


      • #4
        After another week, it seems that if at least one node is not participating in LB rotation, cluster remains stable.
        Maybe Percona engineers can explain this behavior?

        Comment


        • #5
          daq: We'd have to get a better picture of what the problem looks like. We've setup plenty of HAproxy configurations where all nodes are in the load balancing pool. It must be something specific to your environment.

          Comment


          • #6
            Same thing happened to me as well. Since last month, suddenly our percona cluster started to have connection issues and we tried to fix it by applying following settings.

            tcp_tw_recycle=0 tcp_tw_reuse=10 First we edited the both settings but then we had issues even connecting to the server over ssh. So we disabled the tcp_tw_recycle and everything worked until today(12 Days).
            Initially we had 3 node cluster and we added another 2 nodes about two days ago. This morning cluster suddenly stopped replying to queries by throwing

            Lock wait timeout exceeded; try restarting transaction from every node. However we could still login to the MySQL CLI. Failure started on node3 and i was able to record TCP connection status

            EME-MYSQL-N3 #ServerName CLOSE_WAIT 256 ESTABLISHED 73 TIME_WAIT 1 The MySQL error log on Node1 and Node3 can be found here


            Comment


            • #7
              Just to chime in - we have a similar issue with our cluster setup (5 nodes + HA). It all seems to be working fine, but the cluster gets too many connections on every node regularly. It's just stops finishing the threads and they just hang forever and connections multiply as users make requests.

              We have since dropped to only one node and others are for backup. Currently we have fallen to just one node (running in bootstrapped mode) but these issues still occur, albeit less frequently (once in 3-4 days approx). The server is running smoothly, never with more than 10 active queries at the time.

              Then, suddenly, it justs stops. The queries are never finished (in various states) and they multiply as users make new requests. There's nothing at all in error log (and we log warnings as well). The only solution is to restart (killing processes doesn't make new ones be processed).

              We have run standalone Percona Server for a long time and only recently switched to Cluster version. Never had these glitches in standalone mode. The configuration remained basically the same.

              Comment


              • #8
                Hello,
                I'm having this same problem, I'm using the Percona XtraDB Cluster 5.6. When the server is running in standalone no problems, but when another active node it starts to queue requests to infinity and mostly: wsrep in pre-commit stage, it is as if the server would remain in lock, but replication itself works because I can replicate a database for example .. only external access that are compromised.

                Comment


                • #9
                  Hello there! So przemek, I'm having this same problem... Not have a solution or a fix?

                  Comment


                  • #10
                    Guys, in each case it may be a different problem
                    What does the "show status like 'wsrep%';" show, what's exactly in the processlist? Anything in error logs? Do you do any large transactions? What is the cluster status on all nodes? "Too many connections" or "stopped" does not tell us much. Try to investigate the system when the problem is happening. pt-stalk tool or https://github.com/jayjanssen/myq_gadgets may help in that.

                    Comment


                    • #11
                      Hi,

                      what's your ulimit?
                      ulimit -n

                      for just mysqld process:
                      cat /proc/<pid>/limits

                      The size of "Max open files" would be interesting!

                      Comment

                      Working...
                      X