September 22, 2014

PHP Large result sets and summary tables.

We’re working with web site preparing for massive growth. To make sure it handles large data sets as part of the process we work on generation test database of significant size as testing your application on table with 1000 rows may well give you very dangerous false sense of security.

One of the process web site had was creating of summary tables which was done by executing some huge group by query, doing some stuff with results and then populating tables. This all worked well for small tables… but not for larger ones.

First problem was PHP script generating the table took 10GB of RAM and was swapping development server which had just 4GB of Ram (and plenty of swap space) like crazy. Why ? Because by default mysql_query uses mysql_store_result C library call and buffers all result set in the process memory. Not good if there are over 50 millions of rows. Note this limit is not controlled by memory_limit PHP config variable because that only controls memory which passes via PHP memory management which does not apply to MySQL result set.

OK there is “easy” fix for this problem, you can use mysql_unbuffered_query instead and mysqli and PDO have their own way to reach similar behavior. This call users underlying mysql_use_result API call which does not store all result set in memory but instead streams it from the server, fetching in blocks. There are some limits as you can’t use mysql_num_rows() and mysql_data_seek() if you use this method but this is told in PHP manual and so easy to catch. There are however more differences which may cause things breakage

  1. Table Locks – Table locks are not cleared until you fetch whole result set if you’re reading from tables directly (if you do not have “using temporary” in EXPLAIN) this was not issue for given case as GROUP BY in question required temporary table plus it was test system anyway. The workaround for this one is to use SQL_BUFFER_RESULT hint if you need to release table locks early. It comes at cost of creating temporary table though which can be quite high.
  2. Sharing connection no more works If you use buffered query you can use same connection to run other queries, ie INSERTs and UPDATEs while you traverse with data. Not with unbuffered query because connection is still busy.
  3. Need more error checking If you use buffered query the only real call you can get errors is when you run mysql_query , mysql_fetch_row simply reads data from memory and so most applications do not care to check if there are any errors while fetching. With mysql_unbuffered_query data comes in portions so you can well get an error while fetching rows. If you do not check for error it can look as you’ve done with result set while you only processed a portion of it, which can cause rather hard to catch errors.
  4. Connection can timeout If you do not fetch data for long enough MySQL Server may think client is dead and close connection. This well may happen if you need long processing for each row or have long periodic data flushes, ie with multiple value INSERTs etc. This can be fixed by increasing net_write_timeout variable on the server so it gives you more time

But is this the only way ?

Of course not. First you should consider if you need to do processing in PHP at all. Many summary tables can be built by INSERT … SELECT, or some others purely SQL commands and it can be much more efficient. Another alternative is of course to use MySQL Stored Procedures which can be fit to do this simple job.

The downside of using these techniques if of course you’ve got to have summary tables and original tables on the same server which can limit your scalability. Using FEDERATED Tables can work for some cases in others script can be more efficient especially when multiple servers are involved and you want to do some parallel processing.

It is also good question if you need to query all result at once. It is rather efficient bur can cause problems with table locks and other issues plus if script aborts it may be hard to restart. So it may be better structuring your queries to process data by certain objects (ie City by one City at the time) or do INSERT … SELECT to the temporary table with auto_increment column and fetch data from this table using auto increment column ranges instead. I would especially recommend this last way for very long processes, ie if you need to check data against web services and so on – in this case the overhead of creating yet another temporary table is not so large.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Harrison Fisk says:

    Good blog as always Peter. Just commenting cause I think you have slight error when you said that mysql_query uses the mysql_use_result C API. It actually uses the mysql_store_result, and hence takes a lot of memory.

  2. rockerBOO says:

    It is also good to note that INSERT INTO SELECT x FROM will not work from my experience on 2 servers with different hosts.

  3. Scott says:

    Yes, Harrison is right. That one jumped out at me as well. mysql_store_result() buffers the entire set so that you can use mysql_num_rows() and mysql_data_seek(). The rest of the gotchas mentioned do apply to mysql_use_result().

  4. peter says:

    Harrison, Scott,

    Yes that was a typo. I meant to later write about mysql_use_result gotchas so I mentioned it in the wrong place.
    Now fixed.

  5. peter says:

    rockerBOO,

    Yes sure INSERT SELECT does not work for 2 servers, this is why you can try using FEDERATED or use scripts all together.

  6. v says:

    Great tip. If I could combine that with my other half of the battle, perhaps I could solve the PHP timeout issue when dealing with exporting/importing giant datasets to the browser. There is PHP processing involved.

Speak Your Mind

*