Configuring MySQL For High Number of Connections per Second

PREVIOUS POST
NEXT POST

One thing I noticed during the observation was that there were roughly 2,000 new connections to MySQL per second during peak times. This is a high number by any account.

When a new connection to MySQL is made, it can go into the back_log, which effectively serves as a queue for new connections on operating system size to allow MySQL to handle spikes. Although MySQL connections are quite fast compared to many other databases it can become the bottleneck. For a more in depth discussion of what goes on during a connection and ideas for improvement, see this post by Domas]),

With MySQL 5.5 default back_log of 50 and 2000 connections created per second it will take just 0.025 seconds to fill the queue completely if requests are not served, which means even very short stall in main thread which is accepting connections will cause for some connection attempts to be refused.

What back_log variable specifies is second parameter to listen() operating system call. The maximum value on Linux is directed by tcp_max_syn_backlog sysctl parameter unless syncookies are enabled.

The simple solution in this case was to increase the value of back_log to one that could handle longer bursts Increasing back_log to 1000 in this case would give us enough of a queue to handle up to 500ms stall which is good enough for most cases, so it’s important to understand your application workload and tune specifically to your needs.

This got me thinking of what the disadvantages of setting this value equal to the OS limit (/proc/sys/net/ipv4/tcp_max_syn_backlog) and the only limitation would be that you could potentially have a large number of connections waiting on connection instead of failing quickly and potentially connecting to different server, but that can be fixed by setting the client connect timeout setting (which not a lot of people do).

Another important setting if you’re working with many connections per second is thread_cache which impacts cost of connections dramatically. You want it to be set to the value so no more than couple of threads are created every second during normal operation.

I would note if you’re having more than 1000 connections/sec you’re getting pretty close to what could be the limit and you should consider techniques to reduce number of connections. Namely persistent connections and connection pools might be good solution for many applications.

PREVIOUS POST
NEXT POST

Comments

  1. Olivier Doucet says

    Hi,

    I encountered the same problem a few weeks ago. Increasing back_log solved the problem. Unfortunately back_log variable is not dynamic and requires to stop mysqld.
    Also take a look at the following sysctl variables : net.core.netdev_max_backlog ; net.core.somaxconn

  2. Vojtech Kurka says

    Ryan, have you ever witnessed a case, when all connections to mysqld are made from 1 IP address?
    I mean the case, when you have a powerful webserver instance, making more than 1000 connections/sec, and it runs out of emphemeral ports (due to tcp WAIT_TIMEOUT handling). I know setting net.ipv4.tcp_tw_recycle = 1 solves the problem, but can cause other problems for other TCP connections on the same machine and is far less tested than the default setting. When using net.ipv4.tcp_tw_reuse = 1, we get a lot of TCP SYN retransmissions (with painful 3 sec timeout), the kernel probably cannot handle efficiently a large table of tcp connections.

    Thank you, Vojtech

  3. says

    Vojtech,

    It is indeed common problem. Though when you have say just small number of web servers it makes sense to stick to persistent connections with MySQL as you normally can keep all of them open. It is when you get to hundreds of web servers you might not be able to have max_connections high enough.

    Other trick we have been using is to map more than one IP address to your MySQL server. if you connect by server name you can even make it without changing server configuration.

    Another important thing is to ensure ip_local_port_range is large enough. Something like net.ipv4.ip_local_port_range = 1024 65535 can be good idea.

    Other non standard and unsafe options which can work is to set tcp_fin_timeout to some small value, something like 10 seconds.

    Really if someone wants to put the good Linux tcp stack configuration parameters which worked well for them for large number of connections per second and caused no other problems it would be great to see.

  4. Vojtech Kurka says

    Peter, thank you for the reply!

    We have already set the emphemeral port range to the max setting. So I assume we definitely need to start using persistent connections, I hope the php/mysqli implementation is stable enough for production.

  5. says

    mysqlnd (which can be used in mysqli) have been around for a while. Many people using persistent connections quite well especially in simple applications where each user interaction will be only one/few queries.

    Now on back_log not being dynamic – as I mentioned it is parameter for listen call for main socket which server creates during start. I wonder if there is any way it can be adjusted after socket is created as re-creating socket would be disruptive

  6. Olivier Doucet says

    Peter,
    There is a difference between recreating socket (a few milliseconds of downtime), and having to restart MySQL (can take several minutes to stop depending on configuration, and some time after restart to get it working smoothly).

  7. says

    It looks as if a paragraph is missing from the start of this post.

    This change actually made a huge difference for us.

  8. says

    It looks like a paragraph is missing from the start of this.

    Changing this setting to 500 greatly dropped connection errors we were having. Would increasing it further help, or is there really any negatives to increasing it?

  9. says

    Will,

    You should be able to go higher to 2000 or so. The main negative with long queue is the request can spend long time in the queue and at certain point it just becomes better to time out.

    As Ryan mentions connection timeout makes it safe to go even with very high queue times. It is also a good practice to measure connection times.

    BTW what error message did you get for connection errors ? I’m interested in Error code.

  10. says

    Peter,

    Looks like my comment got posted twice.

    This post actually appears to be written about an issue we were having. The error our app was receiving was simply “Can’t connect to mysql server”. We recently decreased our app’s mysql timeout, which is when this started to become more visible. Prior to that, we would have connections sit waiting for a response from the server until PHP execution timed out. There was no error or response sent by the server, only the PHP timeout.

  11. says

    Will, Yes, most likely.
    When you have Can’t connect error message there should be some error code:

    [root@localhost ~]# mysql -h 10.1.1.1
    ERROR 2003 (HY000): Can’t connect to MySQL server on ‘10.1.1.1’ (113)
    [root@localhost ~]# perror 113
    OS error code 113: No route to host

    In this case we have No route to host error. In case listen socket queue overflow you should be getting different error code instead of 113

  12. says

    Thanks,

    This is interesting. I thought Interrupted System Call would happen on connect timeout (which is often triggered with sig_alarm) and failing to connect due to full backlog happening with other error code.

  13. says

    Thanks for the clear explanation Ryan!

    After encountering connection issues near Christmas I came across a really old post from 2008 describing exactly our problem and pointing out the back_log setting needs to be increased (after they consulted Percona). After that I’ve been searching for a clear explanation what the back_log setting exactly does but I always ended up with the two lines supplied by the MySQL documentation.

    Obviously the real cause behind our problem is that we do get short bursts of many processes spawning on the webservers and not using persistent connections. However given our webserver pool size, the short ttl of the processes it would be very dangerous to actually go for persistent connections. We’ll just have to solve this issue in our new architecture and got to hang on till then. ;)

    It is also good to know that we can go as far as 2000. On one cluster we already set it to 200 and we still see issues popping up from time to time.

  14. says

    First, very interesting post.
    I have a question about this: is the server (mysqld or os) aware in anyway of the client connection attempt?
    The error is reported by the mysql client library and this error #4 as well as others are reported even if you try to connect to a non-existing host, so completely independent from the server.
    So on the client side you can’t see the real reason of the error, (4) could be non existing host, timeout or connections burst.
    Will you find any trace of the failed attempt on the server in case it is the problem explained here (backlog)?

    Moreover I have a question, why the error code changes if I set a timeout on the client >=63 seconds?
    I tested also from php-mysql so it must be hardcoded somewhere, may be in the libmysqlclient code.

    [claudio@Device-674DE2 ~]$ mysql -h 172.0.0.1 –connect-timeout=1
    ERROR 2003 (HY000): Can’t connect to MySQL server on ‘172.0.0.1’ (4)
    [claudio@Device-674DE2 ~]$ mysql -h 172.0.0.1 –connect-timeout=62
    ERROR 2003 (HY000): Can’t connect to MySQL server on ‘172.0.0.1’ (4)
    [claudio@Device-674DE2 ~]$ mysql -h 172.0.0.1 –connect-timeout=63
    ERROR 2003 (HY000): Can’t connect to MySQL server on ‘172.0.0.1’ (110)

    [claudio@Device-674DE2 ~]$ perror 4
    OS error code 4: Interrupted system call
    [claudio@Device-674DE2 ~]$ perror 110
    OS error code 110: Connection timed out

    Thanks a lot

    Claudio

  15. says

    Hi Ryan/Peter and all,
    We have recently come up against the MySql (4) connection issue, and am trying to track down a rough guideline as to settings to stop it. We are running 6 slaves in one scenario, and a couple in another. It seems to be happening ‘randomly’ and only about 2-3 times a week.

    I have been playing with the back-lof / max-connections settings, and running multiple versions of a simple test script against my local DB. The test script loops round 1 million times, connecting, querying processlist, and disconnecting, on non-persistent connections.

    So far I have found that :
    1. If back_log is a lot higher than max-connections (bl= 200, mc = 10), I get a ‘Too many Connections error’.
    2. If back_log is a lower than max-connections (bl=5, mc = 10), I get a ‘Cant connect to Mysql (4)’ error.
    However, I have just tried setting back-log to 2 above max-connections (bl=50, mc=48) , and after 800k connections I got a:
    Can’t connect to MySQL server on ‘nn.nn.nn.nn (99)
    When I dropped the back-log to 46, leaving max-connections at 48, I am able to run in excess of 3 million connections without error, at time of writing…(still running)

    Just for completeness my ipv4 settings are as follows:
    1. tcp_max_syn_backlog 2048
    2. tcp_fin_timeout 60
    3. ip_local_port_range 32768 61000
    4. tcp_tw_recyle 0
    5. tcp_tw_reuse 0
    6. somaxconn 128

    My question is… is 2 a ‘magic number’. Does setting the back-log to 2 below the max-connections have some effect on the way MySql handles connections?

  16. says

    Peter,

    Good you have found settings which are working for you ! I would not set back_log to less than max_connections as this indeed can increase the chance of max connection error. I also wonder whenever the numbers are using are practical.

    Typically I see back_log set 50 to 2000 range while max_connections 200 to 5000 range (though note I 5000 connections assumes only few will be active at the time)

  17. says

    Hi All,

    Just got this back from Oracle Support. The Mysql Connection (4) error is actually a BUG. It should be a 110 (Timedout), as per:
    http://bugs.mysql.com/bug.php?id=64333

    This is fixed in 5.6, but will appear as an (4) error in all versions prior to 5.6. It should return a ETIMEDOUT not the EINTR that it does.

    Wish I had checked that out first, would have saved a couple of weeks of tracking down why our code wasn’t working… when it was :-)

  18. says

    Not surprised especially because the error magically changes after 63 seconds.
    I would be curious to know there is that ’63’ value hardcoded (see my previous comment).
    Claudio

  19. says

    Thank you for article, I was getting frequent error of too many conections, solved it by persistent connection trick..

  20. says

    Hi there,
    Just to update we have been getting more of these as we grow, so was looking even deeper into it. We seemed to be getting the magical (4) error at a time when we had not many connections, but were doing some very intensive DB work, passing large chunks of data around (please don’t ask… its old code that we are dealing with). However, as it always happened when these chunks of data were being passed around I got ‘interested’.

    Our standard connect_timeout is set to 1s, as opposed to the default 60s (Claudio… this may be of interest). By using tc/netem we are able to mimic a high network usage. So, using the command below:
    sudo tc qdisc add dev eth0 root netem delay 1000ms
    this delays all traffic by 1s, which causes the (4) error every time.
    Downing that value to 800ms let one or two connections through.
    Downing to 500ms let quite a few more through.
    Upping the timeout to 2s, allowed them all through.

    Essentially we could do the same by filling the network up with large (8mb+) strings of data. But this helped to prove the point. Basically, the network was not allowing the connection handshake through in the correct time, so therefore getting a timeout error, during the handshake process.

    So it could also be that people are experiencing this during an abnormal transaction, or high usage. By increasing the timeout to 2s, you get over these odd periods, without affecting the rest of the system.

    Hope this helps some people

  21. zrlw says

    “When using net.ipv4.tcp_tw_reuse = 1, we get a lot of TCP SYN retransmissions (with painful 3 sec timeout), ”
    please check your kernel version.
    kernels < ~2.6.24.5 have some bug when ipv6 is disabled that results in 3 sec delay for some specific case.

    As "http://www.linuxquestions.org/questions/linux-networking-3/3000ms-delay-on-tcp-connections-613670/&quot; says:
    – "Sometimes" the client try to establish TCP connections to the server (I had the problem with MySQL as well). To do so, it send a TCP SYN packet.
    – The server replies by a "SYN,ACK" packet
    – At this point, the client must have received the "SYN,ACK" packet, reply with a "ACK" packet and set the connection to "ESTABLISHED". Unfortunately, it does not.
    What the client does instead, is for an unknown reason (yet, I hope), to ignore the "SYN,ACK" packet the server sent. This triggers a timeout whose duration is exactly… 3 seconds (hardcoded in the kernel).
    I have for now absolutely no idea why the ipv6 stack solves this problem, but I seems to be a good idea to enable it for now.

  22. Joey Aldrin Cruz says

    Thanks a lot for this. And by the way all of the blogs that the Percona Team posted helps us a lot. Not only ME but also for all those DBA’s out there that seeking for the right answer, when there are new issues / problems occurred and we are not familiar with.

  23. Roy says

    Hi Ryan,

    Thanks for the insight. i have been trying to get around a situation which has plagued my application for more than a week now. i am not sure if this situation is what you might have faced and if back_log would solve. we recently had to shift our database from the network to another cloud network and that is when the hell broke lose. the DB performs well when the number of users are average, but the moment users increase the DB just stops accepting more connections. some occasions i could read but not write. and on some i cant do either. i am accessing the DB over public IP instead of local network IP where it was working fine. i do not think accessing over public IP can be an issue. but the latency in network might be.

    any words for such situation as to how to tackle the issue. i am completely clueless about whats stopping the server to create more connections since the config stats max connection can be allowed upto 1000.

    Roy

Leave a Reply

Your email address will not be published. Required fields are marked *