This topic may look boring and obvious but it is extremely important for MySQL Performance Optimization. In fact I probably have to touch it in every second MySQL Consulting work or even more frequently.
IO Bound workload is quite different from CPU bound one, which happens when your working set (normally only fraction of your database) fits in memory. What is fast when data is in memory can be extremely slow if it does not. For example if you have the query which analyzes 10000 rows it often would take fraction of the second with fully in memory workload, however if you would need to go to the disk, lets say even only in 10% of the cases and so perform 1000 possibly random reads you will have query taking at least 5-10 second, or more under the load which is already way more than you should target for web applications.
So designing your application think what kind of application are you designing ? Can you make it CPU bound/In memory ? If yes the whole class of the problems may not exist and you might be able to use solutions which are easier to implement. But beware if you design your application as CPU bound and when it scales so much you can’t afford to get enough memory any more, you might have very sharp performance drop and complex changes may be required to get your application back to speed.
The cases when going from CPU bound to IO bound hits the most is there a lot of rows are analyzed – count queries, group by, order by without indexes, search queries etc. Basically as soon as you have more than 100 rows analyzed by the query and these rows are “random access” to large tables (so chances they would require physical IO are high) I would highlight this query as possibly having performance problems.
Also do not look just at “typical” case – in many cases worst 5% would be responsible for majority of performance problems.
Let me illustrate it on simple case. Assuming you have Web application which has some form of messaging between users. You may want to display to the user number of unread messages as well as use total number of messages in the mailbox at least to draw “pager”. Easy solution is to do select count(*) from messages where user_id=134 or use SQL_CALC_FOUND_ROWS flag for your main select query. If you’re having CPU bound application it is as much as you might need to go especially if you have some form of caching on top of that. For IO bound application you however will run in trouble pretty soon even with 1000 messages in mailbox may start slowing things down.
Now you may get few percent of very active users who will get both extreme number of messages in their mailbox and will spend a lot of time on the site – generating much more load than average user, so their contribution to the load my be significant plus you do not want to piss of your most loyal users with slow page load times.
So for IO bound applications you will need to add counts for all messages, read messages etc, make sure they are updated (ie use triggers) and make sure all selects use index for ORDER BY .. LIMIT.
For IO bound applications Clustering (data locality) also becomes very important – if Innodb tables are used having simply auto_increment id on messages would likely be much slower than (user_id,sub_id) combined primary key as this one will cluster messages for same user_id and normally allow to fetch them all using only few physical IOs.
You may argue you still will have the problems with such design in CPU bound case – yes you will but it will happen with 100.000 messages not 100 messages as in IO bound case which is large enough for many application classes not to think about it.
Resources
RELATED POSTS