Today I worked on rather interesting customer problem. Site was subject what was considered DDOS and solution was implemented to protect from it. However in addition to banning the intruders IPs it banned IPs of web services which were very actively used by the application which caused even worse problems by consuming all apache slots which were allocated to the problem. Here are couple of interesting lessons one can learn from it.
Implement proper error control In reality it took some time to find what was the issue because there was no error reporting for situation of unavailable web services. If log would be flooded with messages about web services being unavailable it would be much easier to find.
User Curl PHP Has a lot of functions which can accept URL as parameter and just fetch the data transparently for you. They however do not give you good error control and timeout management compared to curl module. Use that when possible it is easy. You can implement your own class to fetch required URL with single call while having all needed timeout handling and reporting to match your application needs. If you’re using PHP functions make sure default_socket_timeout has proper value or set it per session.
Set Curl Timeouts Set both TIMEOUT and CONNECT_TIMEOUT as these apply to different connection stages and just setting timeout is not enough.
Beware of PHP sessions “files” handler I already wrote about this topic, but when troubleshooting this all takes another angle. Default file handler means file gets locked while PHP request is being served. In this case because of network stall request could be taking 100+ seconds. Users are inpatient and do not wait so long pressing reload multiple times… which just adds to the list of users waiting on session file lock. This not only makes apache slots consumed at much higher pace but makes it harder to find what exactly is causing the lock because most of offending processes you can find from apache “server-status” will be just waiting on file to be unlocked. I used “gdb” to connect to the process showing high number of seconds since start finding where it is stuck. If it is somewhere in curl module (or mysql – waiting on long query to come back) – this is our query if it is waiting on the session file lock we can get that file and use fuser to see what other processes are using that files – these would be either waiting on locks or owning the lock and so one of them is the process we’re looking for. Things are much easier with say memcached session storage – this does not cause any locks for parallel session use so only the process which actually stalls waiting on external resource will show high number of seconds since request start.