One of the most typical reasons for performance and scalability problems I encounter is simply failing to do the math. And these are typically bad one because it often leads to implementing architectures which are not up for job they are intended to solve.
Let me start with example to make it clear. Lets say you’re doing some reports from your apache log files – how many distinct visitors hit the page and stuff like that. You picked full logs because they are great in flexibility – you can run any adhoc queries and drill down as much as you like. Initially traffic was small and young and with 10000 page views a day you few days of history the queries there instant which gave you a confidence this approach will work.
As the time passes and you get 1.000.000 events per day and looking to do reporting for up to the whole year worth of data you find things not working any more with response times for individual queries taking half an hour or more when it previously took seconds.
So what math would be in the case like this ? Say you have query like “SELECT page,count(*) cnt FROM logs GROUP BY page ORDER BY cnt DESC LIMIT 100″ to find the most visited pages as a simple example.
Before you can do the math (or say apply mathematical model) you really need to understand how things work. I like to call it having X-Ray vision. If you do not understand what is happening and you see MySQL Server (or any system really) as a black box magically providing you with results you can’t model its behavior which means you have little or no ability to predict it without running benchmarks.
There are couple of ways MySQL may execute query above but lest focus on the most typical one – scan the table, when for each row insert (or update) row in the temporary heap table. After temporary table is populated to the sort and return top 100 rows.
Now there are some important cases to consider which affects the numbers for the model significantly. The table may be in memory or on disk which affects scan speed. Another aspect is the temporary table – the number of rows MySQL can insert/update depends on whenever temporary table can be kept in memory or it spills over to the disk as MyISAM table. Finally there is a sort which can happen in memory or require files to be created on disk.
Understanding these conditions is very important as it allows to predict how performance will change with data size, cardinality or other factors and what optimizations can be important – for example increasing maximum allowed temporary table size in the given case.
But let us get back to practice and do some numbers. For MyISAM table and longer rows as you see in Apache logs I’d estimate 500K rows/sec scanned if data is in memory. In this case CPU is normally the limiting factor. In case we hit the disk the disk read speed becomes the limiting factor – for example with 500 byte rows and 100MB/sec scan speed we can read 200K rows/sec. Note this number can be affected a lot by file fragmentation, on other hand there is also caching which may be taking place. Depends on what your goals with modeling is you can use worse case scenario or some average figure – just understand what figure you’re using.
It is interesting to see in terms of scan speed the difference is not so large for data in memory vs data on disk, the access patterns which have more “random” access patterns – some form of index accesses, joins etc may be slowed down 100-1000 then going from CPU bound to IO bound workload.
You can get approximate numbers for other parts of query execution running micro benchmarks as well.
Running some benchmarks you can also see how many rows query can process per second in total. Lets assume it is 100K rows per second when data is on disk but temporary table fits and memory and sort happens in memory too. This is of course some simplification as processing speed may well be non linear depending on the data size but it can do as a ball park figure.
Having this data we can see the single day report with 10000 events per day is expected to take quite nice 100ms while 10M rows a day even for 30 days will take 300 seconds which will not be acceptable for interactive reports.
Finally let me talk about modeling vs benchmarking for capacity planning. I’m sure you need both but on the different stages.
Micro benchmarks are very helpful to get the numbers you can feed into your model. Using the model we can get a quick feel if things are going to work or they will not. Finally when prototype or complete application is built good benchmarks are important to get exact numbers for the application and see if they match your model predictions. As result of comparison you can discover problems with the model (too bad) or problems with implementation when things just do not work as fast as they should and you can often take some steps to fix them.