November 23, 2014

Worse than DDOS

Today I worked on rather interesting customer problem. Site was subject what was considered DDOS and solution was implemented to protect from it. However in addition to banning the intruders IPs it banned IPs of web services which were very actively used by the application which caused even worse problems by consuming all apache slots which were allocated to the problem. Here are couple of interesting lessons one can learn from it.

Implement proper error control In reality it took some time to find what was the issue because there was no error reporting for situation of unavailable web services. If log would be flooded with messages about web services being unavailable it would be much easier to find.

User Curl PHP Has a lot of functions which can accept URL as parameter and just fetch the data transparently for you. They however do not give you good error control and timeout management compared to curl module. Use that when possible it is easy. You can implement your own class to fetch required URL with single call while having all needed timeout handling and reporting to match your application needs. If you’re using PHP functions make sure default_socket_timeout has proper value or set it per session.

Set Curl Timeouts Set both TIMEOUT and CONNECT_TIMEOUT as these apply to different connection stages and just setting timeout is not enough.

Beware of PHP sessions “files” handler I already wrote about this topic, but when troubleshooting this all takes another angle. Default file handler means file gets locked while PHP request is being served. In this case because of network stall request could be taking 100+ seconds. Users are inpatient and do not wait so long pressing reload multiple times… which just adds to the list of users waiting on session file lock. This not only makes apache slots consumed at much higher pace but makes it harder to find what exactly is causing the lock because most of offending processes you can find from apache “server-status” will be just waiting on file to be unlocked. I used “gdb” to connect to the process showing high number of seconds since start finding where it is stuck. If it is somewhere in curl module (or mysql – waiting on long query to come back) – this is our query if it is waiting on the session file lock we can get that file and use fuser to see what other processes are using that files – these would be either waiting on locks or owning the lock and so one of them is the process we’re looking for. Things are much easier with say memcached session storage – this does not cause any locks for parallel session use so only the process which actually stalls waiting on external resource will show high number of seconds since request start.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. bfw says:

    oh, thanks for the tips – some just saved my life

  2. Ferdy says:

    How would memcached session-handling not lock access to the data?
    If I am not mistaken. sessionfiles lock on purponse to prevent another process from changing values during execution.

    To release the lock in the middle of a script use session_write_close()

  3. peter says:

    Ferdy,

    Yes I know about session_write_close() however it is a code change which is different angle of tuning and in this case information is stored in the section after web server call returns.

    Session files are locked on purpose and I’m not sure how memcache is not locking. This is just what I’ve seen in practice. I guess in memcache you would just have race condition and it is possible with concurrent requests being executed second request to overwrite first request data. If this is the case it is something surely to watch out when you write apps.

    I honestly would prefer explicit locking to implicit one – if application suffers from race conditions let it synchronize to avoid them. MySQL users often have so much race conditions in their apps they do not care about anyway :)

  4. Check out the Spinn3r client design guidelines we wrote up:

    http://code.google.com/p/spinn3r-client/wiki/ClientDesignGuidelines

    We have clients pounding us with requests (about 10M request per client per month) and a non-trivial amount of bandwidth…..(hundreds of GBs per month).

    We wrote up the common mistakes we found with API implementations.

    … might help and interesting in the least.

  5. peter says:

    Thanks Kevin, Good notes. Though I guess not all of them would apply for “real time” web services use. For example always retry but wait 30 sec in between – this is perfect for batch jobs but for real time page views – if it does not work you may not have a time to retry and surely not to wait 30 sec :)

Speak Your Mind

*