Content delivery system design mistakes

PREVIOUS POST
NEXT POST

This week I helped dealing with performance problems (part MySQL related and part related to LAMP in general) of system which does quite a bit of content delivery, serving file downloads and images – something a lot of web sites need to do these days. There were quite a bit of mistakes in design for this one which I though worth to note, adding some issues seen in other systems.

Note this list applies to static content distribution, dynamic content has some of its own issues which need different treatment.

DNS TTL Settings The system was using DNS based load balancing, using something like img23.domain.com to serve some of the images. I’m not big fan of purely DNS based load balancing and HA but it works if configured well. In this case however the problem was zero TTL set in DNS configuration. This obviously adds latency especially for “aggregate” pages which may require images to be pulled from 10 different image servers.

Keep Alive In my previous post I wrote you often do not need keep alive for dynamic pages (there are also exceptions) but you really should have Keep Alive enabled while serving images. It especially hurts not to have one if 30 thumbnails are loaded per page if you do not have one.

Use Proper Web Server This one is pretty interesting. Many learned apache is not good for serving static content especially with many thousands of keep-alive connections. Lighttpd is often named as faster alternative. It is surely true if you server static content from memory or serving large files (in which case read-ahead helps) but if you’re serving many millions of thumbnails it may not work well. Lighttpd 1.4 is single threaded and that single threads is used to handle both network and IO which is not going to scale especially if you have storage build from large number of hard drives. This problem is fixed in Lighttpd 1.5 which is however still in pre-release (though we have it running in production). There are other solutions such as nginx which does not seems to have this problem. My point with this item is not advice you which exactly web server to use but to point out serving dynamic content, serving in-memory static content and serving static content from the disk are different tasks and should not be mixed.

Use noatime mount option Serving images you rarely need last access time tracked for them, especially as this tracking can be quite expensive if there are a lot of different files accessed concurrently. It is especially worth to note updating access time needs io even in case content is in OS cache. Simple solution is use noatime mount option for your static content partition. Some web servers also support for O_NOATIME file open flag on newer Linux versions, which also can be used.

Using PHP to serve files This usually comes in play if security control for files or traffic limit per user needs to be enforced. Serving file by simply reading from PHP or any other heavy programming language and sending it back is worse thing you can do. The optimal solution would be to using some server modules for access control or hacking one if you need some special functionality, few people however have skills to do this kind of job. The other solution is to use X-send-file or similar custom headers which make PHP script just to check restrictions and have web server to send the file. This technique actually was used in the project we’re speaking about with exception of resuming file downloads. In lighttpd 1.4 you could not tell server to send only part of the file so it had to be implemented in php. However even small portion of such downloads caused a lot of trouble especially as lighttpd tries to buffer all content it gets from PHP script which means many megabytes for large files. Partial file sending happily was added in lighttpd 1.5 which was one of the reason to move to it quickly.

Thumbnail size In many cases you need multiple sizes of thumbnails, such as very small size to show in list mode, standard size to show in preview mode and may be something else. It looks very attractive to keep only one thumbnail and simply set browser image sizes if you need to stick it to smaller space. It is not so good in practice however. First browsers may not resize image with optimal quality and it may just look ugly compared to properly created thumbnails. It also makes things a lot slower. I for example seen 8K thumbnail in the case when smallest one should have been less than 1KB in size. With 30 images on the page it is 30KB worth of download vs 240KB which makes quite a bit of difference especially on slower connection speeds.

Forgetting to set expire Static content rarely changes, in many implementations it may never change as loading new version of the same object will result in different url. This means you really need to have expire headers for them set, otherwise you will be getting a lot of cache revalidation requests which still require stat() calls on web server side and which can be avoided. How far in the future you want expire to be… it depends on the application of course. It can range for few hours for objects changes to which you want to be quickly visible to infinity for objects which only change with their urls.

Server side caching The benefit of server side caching is not so obvious for serving static content, or better say it depends on situation. If you serve a lot of small files which can fit in squid (for example) in-memory cache it may work quite well and in fact will give better memory utilization as OS cache has single page granularity (meaning 100 bytes file will still take 4KB in cache). For serving large set of files (which does not fit in memory) you can have performance to go down as request will be frequently made to the main server to get the file. Also it frequently does not make sense to use disk cache for static content as getting it from the server may be close in speed. It also of course depends on the server which you’re using – apache in prefork mode (ie same server used for static and dynamic content) would likely to benefit a lot from one.

Using different servers As I already mentioned serving different content requires different skills from web servers or different configuration, so it is good idea to use dedicated server name, such as static.domain.com for serving static content, so even if you do not need it now you have flexibility in the future. Also even for same server it allows to configure different virtual host with different settings easily.

PREVIOUS POST
NEXT POST

Comments

  1. Alexey says

    Just a few comments:

    TTL – I think most caching DNS servers have a TTL threshold, so even if your zone has lower TTL, it won’t be checked more often anyway. There are no exact numbers, but if you use TTL values lower than a day, you can’t rely on it.

    KA – it’s best to offload idle keep-alive connections to front-end web server.

    Noatime – don’t forget to use ‘nodiratime’ flag in addition to noatime. Regarding O_NOATIME flag, you have to have root privileges to be able to use it on Linux. IMO running daemon that serves static content as a root is a big no-no.

    Expires – 1) setting it too high may cause browsers to ignore it by default. Make sure you try different values and see which one produces less requests. 2) don’t forget to update expires date when answering requests containing If-Modified-Since header.

  2. peter says

    Thank you Alexey,

    I’m not saying you can rely on TTL my point is setting it too low causes extra overhead which you would like to avoid.

    KA – I’m speaking about simple configuration right now. Smart front end servers may change things but it does not always presents, especially for static content.

    Thank you about nodiratime which also is helpful if you have a lot of dirs. I did not check it about O_NOATIME root requirements to be honest as I go simple way and simply use mount options.

    Good note about expires. What do you mean about answering If-Modified-Since header ? But I guess for static content web server expire handling module will handle it for you if something needs to be done

  3. says

    TTL:
    Be default windows xp uses the lesser of 1 day or the value specified in the result. I am not sure what happens at 0. A value of 1 is treated as disabled.
    Ref: http://support.microsoft.com/kb/318803
    We handle a lot of server migrations for shared hosting companies and have found that many ISPs have their own threshold. I’ve see some smaller ones with values as high as 2 days. Other networking gear may also cache DNS with various thresholds. Though certainly worth considering, I have not seen too many cases where latency caused by DNS lookup is significant compared to other sources.

    CDN Compression: We’ve used a couple of CDN’s and some do their own compression. We had an issue with PDF’s once due to compression at the CDN level though we had explicitly disabled compression on the source servers. If compression is by file-type with no size limits, this can impact the user experience with some types of files.

  4. mephius says

    you could use X-Accel-Redirect header with nginx, which implements similar idea with X-send-file in Lighttpd

  5. says

    Older browsers had number of bugs so should we get back to old HTML 1.0 and avoid Javascript all together ?

    You should choose what you’re going to be compatible with and if browsers you want to be compatible with have some issues you can work these around, for example disable JavaScript compression keep alive or ranged requests based on user agent.

Leave a Reply

Your email address will not be published. Required fields are marked *