We’re working with web site preparing for massive growth. To make sure it handles large data sets as part of the process we work on generation test database of significant size as testing your application on table with 1000 rows may well give you very dangerous false sense of security.
One of the process web site had was creating of summary tables which was done by executing some huge group by query, doing some stuff with results and then populating tables. This all worked well for small tables… but not for larger ones.
First problem was PHP script generating the table took 10GB of RAM and was swapping development server which had just 4GB of Ram (and plenty of swap space) like crazy. Why ? Because by default mysql_query uses mysql_store_result C library call and buffers all result set in the process memory. Not good if there are over 50 millions of rows. Note this limit is not controlled by memory_limit PHP config variable because that only controls memory which passes via PHP memory management which does not apply to MySQL result set.
OK there is “easy” fix for this problem, you can use mysql_unbuffered_query instead and mysqli and PDO have their own way to reach similar behavior. This call users underlying mysql_use_result API call which does not store all result set in memory but instead streams it from the server, fetching in blocks. There are some limits as you can’t use mysql_num_rows() and mysql_data_seek() if you use this method but this is told in PHP manual and so easy to catch. There are however more differences which may cause things breakage
But is this the only way ?
Of course not. First you should consider if you need to do processing in PHP at all. Many summary tables can be built by INSERT … SELECT, or some others purely SQL commands and it can be much more efficient. Another alternative is of course to use MySQL Stored Procedures which can be fit to do this simple job.
The downside of using these techniques if of course you’ve got to have summary tables and original tables on the same server which can limit your scalability. Using FEDERATED Tables can work for some cases in others script can be more efficient especially when multiple servers are involved and you want to do some parallel processing.
It is also good question if you need to query all result at once. It is rather efficient bur can cause problems with table locks and other issues plus if script aborts it may be hard to restart. So it may be better structuring your queries to process data by certain objects (ie City by one City at the time) or do INSERT … SELECT to the temporary table with auto_increment column and fetch data from this table using auto increment column ranges instead. I would especially recommend this last way for very long processes, ie if you need to check data against web services and so on – in this case the overhead of creating yet another temporary table is not so large.
Resources
RELATED POSTS