Recently I had a case with a web server farm where a random node went down every few minutes. I don’t mean any of them rebooted except once or twice, but rather they were slowing down so much that practically stopped serving any requests and were being pulled out from the LVS cluster. The traffic was not any different than usual, all other elements of the system worked perfectly fine (e.g. databases, storage), no one started any backup in the middle of the day as it happens sometimes… so what was happening?
First I am going to describe the setup a little bit. As I already mentioned it was about web servers. Each of them was running Lighttpd that handled the requests coming from the internet. It was configured however only to serve static content, such as images. The requests asking for PHP files were passed down with proxy module to Apache listening on another TCP port.
And so I started investigating the problem. As it turned out the systems were slowing down because
process grew to a few gigabytes eating the entire memory which caused system to start swapping heavily. This usually means death to a busy on-line system. Initially I thought about hitting some Lighttpd bug as nothing else seemed wrong, but after a short while I remembered one thing that can cause such behavior. If you use it as a proxy, it will need to buffer the entire response from the backend server before sending to the client. And indeed I started browsing Apache access log and found entries similar to this appearing every few minutes:
127.0.0.1 - - [15/Nov/2007:08:10:56 -0500] "GET /php/call.php?page=somearg HTTP/1.0" 200 5105062572
Apparently some bug in PHP code with a loop having far too many iterations and even though Apache and PHP handled it without much hassle, Lighttpd kept allocating memory to fit the entire response into the buffer and caused all that mess.
Although this was an extreme case, it is easy to imagine the situation where the problems will appear with much smaller data being sent through Lighttpd proxy. For example with a PHP script for handling larger file downloads which gets many concurrent requests.