We’re working with web site preparing for massive growth. To make sure it handles large data sets as part of the process we work on generation test database of significant size as testing your application on table with 1000 rows may well give you very dangerous false sense of security.
One of the process web site had was creating of summary tables which was done by executing some huge group by query, doing some stuff with results and then populating tables. This all worked well for small tables… but not for larger ones.
First problem was PHP script generating the table took 10GB of RAM and was swapping development server which had just 4GB of Ram (and plenty of swap space) like crazy. Why ? Because by default mysql_query uses mysql_store_result C library call and buffers all result set in the process memory. Not good if there are over 50 millions of rows. Note this limit is not controlled by memory_limit PHP config variable because that only controls memory which passes via PHP memory management which does not apply to MySQL result set.
OK there is “easy” fix for this problem, you can use mysql_unbuffered_query instead and mysqli and PDO have their own way to reach similar behavior. This call users underlying mysql_use_result API call which does not store all result set in memory but instead streams it from the server, fetching in blocks. There are some limits as you can’t use mysql_num_rows() and mysql_data_seek() if you use this method but this is told in PHP manual and so easy to catch. There are however more differences which may cause things breakage
- Table Locks – Table locks are not cleared until you fetch whole result set if you’re reading from tables directly (if you do not have “using temporary” in EXPLAIN) this was not issue for given case as GROUP BY in question required temporary table plus it was test system anyway. The workaround for this one is to use SQL_BUFFER_RESULT hint if you need to release table locks early. It comes at cost of creating temporary table though which can be quite high.
- Sharing connection no more works If you use buffered query you can use same connection to run other queries, ie INSERTs and UPDATEs while you traverse with data. Not with unbuffered query because connection is still busy.
- Need more error checking If you use buffered query the only real call you can get errors is when you run mysql_query , mysql_fetch_row simply reads data from memory and so most applications do not care to check if there are any errors while fetching. With mysql_unbuffered_query data comes in portions so you can well get an error while fetching rows. If you do not check for error it can look as you’ve done with result set while you only processed a portion of it, which can cause rather hard to catch errors.
- Connection can timeout If you do not fetch data for long enough MySQL Server may think client is dead and close connection. This well may happen if you need long processing for each row or have long periodic data flushes, ie with multiple value INSERTs etc. This can be fixed by increasing net_write_timeout variable on the server so it gives you more time
But is this the only way ?
Of course not. First you should consider if you need to do processing in PHP at all. Many summary tables can be built by INSERT … SELECT, or some others purely SQL commands and it can be much more efficient. Another alternative is of course to use MySQL Stored Procedures which can be fit to do this simple job.
The downside of using these techniques if of course you’ve got to have summary tables and original tables on the same server which can limit your scalability. Using FEDERATED Tables can work for some cases in others script can be more efficient especially when multiple servers are involved and you want to do some parallel processing.
It is also good question if you need to query all result at once. It is rather efficient bur can cause problems with table locks and other issues plus if script aborts it may be hard to restart. So it may be better structuring your queries to process data by certain objects (ie City by one City at the time) or do INSERT … SELECT to the temporary table with auto_increment column and fetch data from this table using auto increment column ranges instead. I would especially recommend this last way for very long processes, ie if you need to check data against web services and so on – in this case the overhead of creating yet another temporary table is not so large.