October 25, 2014

Just do the math!

One of the most typical reasons for performance and scalability problems I encounter is simply failing to do the math. And these are typically bad one because it often leads to implementing architectures which are not up for job they are intended to solve.

Let me start with example to make it clear. Lets say you’re doing some reports from your apache log files – how many distinct visitors hit the page and stuff like that. You picked full logs because they are great in flexibility – you can run any adhoc queries and drill down as much as you like. Initially traffic was small and young and with 10000 page views a day you few days of history the queries there instant which gave you a confidence this approach will work.

As the time passes and you get 1.000.000 events per day and looking to do reporting for up to the whole year worth of data you find things not working any more with response times for individual queries taking half an hour or more when it previously took seconds.

So what math would be in the case like this ? Say you have query like “SELECT page,count(*) cnt FROM logs GROUP BY page ORDER BY cnt DESC LIMIT 100″ to find the most visited pages as a simple example.

Before you can do the math (or say apply mathematical model) you really need to understand how things work. I like to call it having X-Ray vision. If you do not understand what is happening and you see MySQL Server (or any system really) as a black box magically providing you with results you can’t model its behavior which means you have little or no ability to predict it without running benchmarks.

There are couple of ways MySQL may execute query above but lest focus on the most typical one – scan the table, when for each row insert (or update) row in the temporary heap table. After temporary table is populated to the sort and return top 100 rows.

Now there are some important cases to consider which affects the numbers for the model significantly. The table may be in memory or on disk which affects scan speed. Another aspect is the temporary table – the number of rows MySQL can insert/update depends on whenever temporary table can be kept in memory or it spills over to the disk as MyISAM table. Finally there is a sort which can happen in memory or require files to be created on disk.

Understanding these conditions is very important as it allows to predict how performance will change with data size, cardinality or other factors and what optimizations can be important – for example increasing maximum allowed temporary table size in the given case.

But let us get back to practice and do some numbers. For MyISAM table and longer rows as you see in Apache logs I’d estimate 500K rows/sec scanned if data is in memory. In this case CPU is normally the limiting factor. In case we hit the disk the disk read speed becomes the limiting factor – for example with 500 byte rows and 100MB/sec scan speed we can read 200K rows/sec. Note this number can be affected a lot by file fragmentation, on other hand there is also caching which may be taking place. Depends on what your goals with modeling is you can use worse case scenario or some average figure – just understand what figure you’re using.

It is interesting to see in terms of scan speed the difference is not so large for data in memory vs data on disk, the access patterns which have more “random” access patterns – some form of index accesses, joins etc may be slowed down 100-1000 then going from CPU bound to IO bound workload.

You can get approximate numbers for other parts of query execution running micro benchmarks as well.

Running some benchmarks you can also see how many rows query can process per second in total. Lets assume it is 100K rows per second when data is on disk but temporary table fits and memory and sort happens in memory too. This is of course some simplification as processing speed may well be non linear depending on the data size but it can do as a ball park figure.

Having this data we can see the single day report with 10000 events per day is expected to take quite nice 100ms while 10M rows a day even for 30 days will take 300 seconds which will not be acceptable for interactive reports.

Finally let me talk about modeling vs benchmarking for capacity planning. I’m sure you need both but on the different stages.

Micro benchmarks are very helpful to get the numbers you can feed into your model. Using the model we can get a quick feel if things are going to work or they will not. Finally when prototype or complete application is built good benchmarks are important to get exact numbers for the application and see if they match your model predictions. As result of comparison you can discover problems with the model (too bad) or problems with implementation when things just do not work as fast as they should and you can often take some steps to fix them.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. pat says:

    There are some non linear points to worry about too though aren’t there? I’d expect to see non linear timing changes when my data:

    1) grew large enough that it didn’t fit in buffer cache anymore
    2) grew large enough it didn’t fit in sort buffers anymore and had to spill to disk

    For that vast “middle” size of data you can get the more pernicious performance problems of “it was fine last week, and its only 10% larger this week, but the queries are all taking 4x as long”.

  2. peter says:

    Pat,

    There are always some surprises. If you got 10% more data and queries started taking 4x there is a good chance something is wrong. May be the execution plan has changed ?
    It is also however possible to have large drops in performance because of working set. Consider for example if table has uniform random access when it fits in memory (100% hit ratio) vs 10% miss ratio if it grows 10%. You can access data in memory at rate of 500.000 rows/sec while if it is the disk it is 200 rows/sec – with such numbers 10% miss rate will cause 500K -> 2K rows/sec which is 250 _times_ performance loss.

  3. Brian says:

    You would be surprised how controversial “doing the math” is. To most the database is supposed to magically compensate for whatever you throw at it regardless of how its configured or your data model is designed. Saying otherwise is well, “being mean”.

  4. Apachez says:

    For example number of rows have been optimized in myisam to be a single value so table scan isnt needed (compared to innodb) in order to find out how many rows the table contains.

    Have there been any talks of optimizing similar stuff for mysql 6.0 or such (like a storage independent aggregation)?

    If we take the example in this blogentry it would be nice if mysql would have done the aggregation on its own in the index regarding the count of each unique page entry so the select in the example would go in msecs instead of minutes or hours. So I as a user of a mysql database only need to care to throw in the data (like from the apache log in realtime). Compared to today where I most likely would need to take care of the aggregation on my own with 2 inserts per log line instead of just 1 (one to the large log db and one to the aggregated db).

    If im not mistaken mssql does similar stuff since a while back.

Speak Your Mind

*