We’re working with web site preparing for massive growth. To make sure it handles large data sets as part of the process we work on generation test database of significant size as testing your application on table with 1000 rows may well give you very dangerous false sense of security.
One of the process web site had was creating of summary tables which was done by executing some huge group by query, doing some stuff with results and then populating tables. This all worked well for small tables… but not for larger ones.
First problem was PHP script generating the table took 10GB of RAM and was swapping development server which had just 4GB of Ram (and plenty of swap space) like crazy. Why ? Because by default mysql_query uses mysql_store_result C library call and buffers all result set in the process memory. Not good if there are over 50 millions of rows. Note this limit is not controlled by memory_limit PHP config variable because that only controls memory which passes via PHP memory management which does not apply to MySQL result set.
OK there is “easy” fix for this problem, you can use mysql_unbuffered_query instead and mysqli and PDO have their own way to reach similar behavior. This call users underlying mysql_use_result API call which does not store all result set in memory but instead streams it from the server, fetching in blocks. There are some limits as you can’t use mysql_num_rows() and mysql_data_seek() if you use this method but this is told in PHP manual and so easy to catch. There are however more differences which may cause things breakage
But is this the only way ?
Of course not. First you should consider if you need to do processing in PHP at all. Many summary tables can be built by INSERT … SELECT, or some others purely SQL commands and it can be much more efficient. Another alternative is of course to use MySQL Stored Procedures which can be fit to do this simple job.
The downside of using these techniques if of course you’ve got to have summary tables and original tables on the same server which can limit your scalability. Using FEDERATED Tables can work for some cases in others script can be more efficient especially when multiple servers are involved and you want to do some parallel processing.
It is also good question if you need to query all result at once. It is rather efficient bur can cause problems with table locks and other issues plus if script aborts it may be hard to restart. So it may be better structuring your queries to process data by certain objects (ie City by one City at the time) or do INSERT … SELECT to the temporary table with auto_increment column and fetch data from this table using auto increment column ranges instead. I would especially recommend this last way for very long processes, ie if you need to check data against web services and so on – in this case the overhead of creating yet another temporary table is not so large.
Resources
RELATED POSTS
Good blog as always Peter. Just commenting cause I think you have slight error when you said that mysql_query uses the mysql_use_result C API. It actually uses the mysql_store_result, and hence takes a lot of memory.
It is also good to note that INSERT INTO SELECT x FROM will not work from my experience on 2 servers with different hosts.
Yes, Harrison is right. That one jumped out at me as well. mysql_store_result() buffers the entire set so that you can use mysql_num_rows() and mysql_data_seek(). The rest of the gotchas mentioned do apply to mysql_use_result().
Harrison, Scott,
Yes that was a typo. I meant to later write about mysql_use_result gotchas so I mentioned it in the wrong place.
Now fixed.
rockerBOO,
Yes sure INSERT SELECT does not work for 2 servers, this is why you can try using FEDERATED or use scripts all together.
Great tip. If I could combine that with my other half of the battle, perhaps I could solve the PHP timeout issue when dealing with exporting/importing giant datasets to the browser. There is PHP processing involved.