Emergency

Too many connections brings down cluster

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Too many connections brings down cluster

    Recently converted our IPB forum from a single server to a 3-server PXC balanced through HAProxy (leastconn.) All servers are VMs in the same DC so latency between them is minimal. The forum is configured to write to all 3 servers.

    I constantly see "Too many connections" warning in the logs, but several times per week the cluster becomes completely unresponsive. There is usually one or two servers that exhausted max_connections. The only solution is to shut the cluster down completely and bootstrap from scratch.

    This doesn't seem related to the load on the forum and I've never seen this error when using a single server.

    I already increased max_connections from 500 to 1000 and the issue still occurs.

    I currently set HAProxy to send all SQL traffic to a single server in the cluster and so far, no crashes.

    Is multi-master setup just not reliable enough to use in production?
    Are there any settings to improve stability even at the cost of replication speed?
    How would I troubleshoot this other than monitoring number of active connections?

    Thanks.

  • #2
    I think the first what you should do is to check what are all those connections doing. Is it only the issue with closing them or you were experiencing some queries pile-ups, where there were too many threads running, or blocked on something?
    The output from
    Code:
    SHOW STATUS LIKE 'Threads%';
    and some simple stats from information_schema.processlist, like:
    HTML Code:
    SELECT substring_index(host, ':',1) AS host_name,state,count(*) FROM information_schema.processlist  GROUP BY state,host_name;
    could be a good start of investigation.

    Comment


    • #3
      First I want to mention that problem goes away completely when all traffic is directed to a single node in HAProxy. Has been running without any problems for a week. Of course, that completely defeats the point of running a cluster, but at least it confirms that problem is caused by accessing multiple nodes for r/w at the same time.

      Here's the output of the commands you mentioned from the server currently handling all traffic:
      Code:
      mysql> SHOW STATUS LIKE 'Threads%';
      +-------------------+-------+
      | Variable_name     | Value |
      +-------------------+-------+
      | Threads_cached    | 47    |
      | Threads_connected | 5     |
      | Threads_created   | 7466  |
      | Threads_running   | 3     |
      +-------------------+-------+
      4 rows in set (0.00 sec)
      Code:
      mysql> SELECT substring_index(host, ':',1) AS host_name,state,count(*) FROM information_schema.processlist  GROUP BY state,host_name;
      +--------------+---------------------------+----------+
      | host_name    | state                     | count(*) |
      +--------------+---------------------------+----------+
      | 10.12.28.248 |                           |        2 |
      |              | committed 13012839        |        1 |
      | localhost    | executing                 |        1 |
      |              | wsrep aborter idle        |        1 |
      | 10.12.28.248 | wsrep in pre-commit stage |        1 |
      +--------------+---------------------------+----------+
      5 rows in set (0.01 sec)

      Comment


      • #4
        After another week, it seems that if at least one node is not participating in LB rotation, cluster remains stable.
        Maybe Percona engineers can explain this behavior?

        Comment


        • #5
          daq: We'd have to get a better picture of what the problem looks like. We've setup plenty of HAproxy configurations where all nodes are in the load balancing pool. It must be something specific to your environment.

          Comment


          • #6
            Same thing happened to me as well. Since last month, suddenly our percona cluster started to have connection issues and we tried to fix it by applying following settings.

            tcp_tw_recycle=0 tcp_tw_reuse=10 First we edited the both settings but then we had issues even connecting to the server over ssh. So we disabled the tcp_tw_recycle and everything worked until today(12 Days).
            Initially we had 3 node cluster and we added another 2 nodes about two days ago. This morning cluster suddenly stopped replying to queries by throwing

            Lock wait timeout exceeded; try restarting transaction from every node. However we could still login to the MySQL CLI. Failure started on node3 and i was able to record TCP connection status

            EME-MYSQL-N3 #ServerName CLOSE_WAIT 256 ESTABLISHED 73 TIME_WAIT 1 The MySQL error log on Node1 and Node3 can be found here


            Comment


            • #7
              Just to chime in - we have a similar issue with our cluster setup (5 nodes + HA). It all seems to be working fine, but the cluster gets too many connections on every node regularly. It's just stops finishing the threads and they just hang forever and connections multiply as users make requests.

              We have since dropped to only one node and others are for backup. Currently we have fallen to just one node (running in bootstrapped mode) but these issues still occur, albeit less frequently (once in 3-4 days approx). The server is running smoothly, never with more than 10 active queries at the time.

              Then, suddenly, it justs stops. The queries are never finished (in various states) and they multiply as users make new requests. There's nothing at all in error log (and we log warnings as well). The only solution is to restart (killing processes doesn't make new ones be processed).

              We have run standalone Percona Server for a long time and only recently switched to Cluster version. Never had these glitches in standalone mode. The configuration remained basically the same.

              Comment


              • #8
                Hello,
                I'm having this same problem, I'm using the Percona XtraDB Cluster 5.6. When the server is running in standalone no problems, but when another active node it starts to queue requests to infinity and mostly: wsrep in pre-commit stage, it is as if the server would remain in lock, but replication itself works because I can replicate a database for example .. only external access that are compromised.

                Comment


                • #9
                  Hello there! So przemek, I'm having this same problem... Not have a solution or a fix?

                  Comment


                  • #10
                    Guys, in each case it may be a different problem
                    What does the "show status like 'wsrep%';" show, what's exactly in the processlist? Anything in error logs? Do you do any large transactions? What is the cluster status on all nodes? "Too many connections" or "stopped" does not tell us much. Try to investigate the system when the problem is happening. pt-stalk tool or https://github.com/jayjanssen/myq_gadgets may help in that.

                    Comment


                    • #11
                      Hi,

                      what's your ulimit?
                      ulimit -n

                      for just mysqld process:
                      cat /proc/<pid>/limits

                      The size of "Max open files" would be interesting!

                      Comment


                      • #12
                        The same issue happened to me. I've deployed a 3 node cluster using Percona XtraDB 5.6

                        We have run our cluster for days without any problem, but some day we faced this problem: we ran out of connections and the cluster went off.
                        I did some troubleshooting and I found a lot of connections with the same state when I executed show processlist command: unauthorized user - trying to connect

                        I don't know what was causing this issue. I've called to the network guys and they said the load balancer at that time we had the problem was only with 10 concurrent sessions.

                        I tried to increase max_connections value from 2048 to 4096 ant the issue remained.

                        All these 3 nodes are physical machines with 48cpu and 128GB of memory.

                        Could someone give me some clarifications about how to solve this issue or what is happening ?

                        Comment


                        • #13
                          fmarinho1980, usually first thing to investigate is to take a look at the mysql's error log and current status variables. What does it mean "cluster went off"? Was the Threads_connected=max_connections? Was the cluster operational - nodes in SYNCED and Primary state? This is all too little data to see what was really happening.

                          Comment


                          • #14
                            So, this thread has existed for over three years! In that time, the Percona Engineers have basically responded: "We don't have enough information to help you." But the subject of this thread is a KNOWN problem, many people have encountered the problem, and there appears to be NO solution.

                            For our part, we have started using ProxySQL in an attempt to GUARANTEE that all transactions take place on the same cluster node, and that "helped," but the problem still occurs now and then. Just last night, in the wee hours, our production three-node cluster locked up, would not handle any more queries, and was completely unresponsive to our app. The fact that this can suddenly happen AT ALL is in principle a deal-killer. We've invested far too much time already (years!) into trying to figure out HOW the cluster can possibly get itself into this state. And I fail to believe that the Percona Engineers have NEVER encountered this issue in their own testing.

                            For them to have not discovered this issue themselves (and FIXED it years ago), one of two things is the case, and I don't know which. Either they are testing on a simple, non-prod, "insignificant" environment which does not model real-world usage AT ALL, or they are not motivated to fix this problem because the only way people pay for "support" is if they are desperate and need on-the-phone, help-me-right-now sorts of communications with the Percona team.

                            If this seems provocative, I intend it to be. It's OUTRAGEOUS that a problem of this magnitude can still exist after this many years, and the Percona team apparently doesn't take seriously how devastating it is to have a production environment suddenly lock-up and affect customers, while we are frantically "bootstrapping" the cluster back into existence and thumping our feet wildly in frustration, while hours of resyncing takes place. There is NO excuse for the fact that this is a KNOWN problem and that the Percona team has not DEVOTED itself to replicating the issue and then fixing it! I am POSITIVE that this issue can be replicated, and Percona should be devoting themselves to doing that very thing! Yet, YEARS go by, and Percona seems to not seriously acknowledge that this even IS a deal-breaking issue!

                            The fact is that Percona cluster is NOT "ready for prime time." The master/master approach (which is why you'd really bother with the hassles of a cluster in the first place) is simply NOT reliable, and we've devoted ourselves for years to "patching it up" with the likes of ProxySQL and our own custom scripts. ALL we've been able to accomplish is "put off" the time in which the cluster WILL crash.

                            The nature of the "crash" itself seems to be that quite suddenly the nodes cannot sync, as they don't even have the needed available connections to make connections among themselves! So, this "out of connections" error is not just a "symptom." It is indicative of a fundamental "flow" taking place between the nodes, such that (and always very suddenly) the nodes cannot communicate among themselves. And, in this state, you cannot simply restart MySQL on one node at a time to "clear" the connections. Once in this state, even a MySQL restart only results in the restarted node hanging there, unable to resync with the other nodes. The needed connections to do so are GONE.

                            What is needed (apart from Percona's team tracking down how the problem can occur in the first place) is some setting to ensure that some proportion of available connections are ALWAYS dedicated to inter-node connections, so that syncing can ALWAYS occur, no matter what. No matter how "badly" an application might be written, there is NO EXCUSE that the cluster can get ITSELF into such a state that it cannot even communicate among its own nodes!

                            In addition, Percona should be logging its "core-state," including any internal variables that could indicate that it is "in trouble," whatever that might mean. That way, at least with Nagios/Icinga or some other monitoring service, you could detect that you had better intervene and restart the nodes BEFORE it's a full-bootstrap event to regain control of the cluster!

                            What does "in trouble" mean? I don't know, but Percona's engineers SHOULD! Again, this thread is three years old, and there is still no taking the issue seriously from Percona's perspective. If MY application had this severe of a problem, MY team would be working night and day until we had tracked down how it could EVER happen, and it would have been fixed long before THREE YEARS had passed!

                            So, for anybody encountering this problem (and this thread in the hopes of finding a solution), I'll tell you what our "solution" is about to be: We give up on Percona. We're angry, and this very thread made us much angrier! We'll be moving to PostgreSQL and meanwhile investigating MySQL's own cluster (which a couple of years ago had not seemed ready for prime time); perhaps it's better now. But, seriously, Percona is NOT a production-ready "MySQL cluster," and the Percona team obviously cannot be bothered to track this FUNDAMENTAL problem down on their own and fix it. The response we see repeated here: "We don't have enough information" is unbelievably LAME!

                            To the Percona team, most of the people on this thread have simply moved on (as we're about to do). You're not even asking the right questions on this thread. It falls to YOU to replicate this problem and FIX it. Meanwhile, users like us that have been pleased enough with the promise of Percona cluster have devoted COUNTLESS man-hours to trying to "patch up" the underlying issue, and we're done with it. Your dismissive attitude on this thread is quite maddening, and we will no longer try to "patch up" the fact that you don't acknowledge the problem, track it down, and FIX it after three (and more) years. If you were serious about having a production-ready cluster, you'd be contacting people like us, hat in hand (rather than wanting to CHARGE for the information YOU need), hoping to get us into a screen-sharing session, so that you could review our setup in GREAT detail, so that you could HAVE the information you say on this thread you need. We'd be happy to walk you through our setup, and I believe that you'd be impressed. But in this thread you clearly indicate that you can't be bothered, and that is, flatly, ridiculous!

                            Comment

                            Working...
                            X