Announcement

Announcement Module
Collapse
No announcement yet.

Improve a query to select random rows, but without repeating records

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Improve a query to select random rows, but without repeating records

    Dear Srs,

    I have a table with near 1.000.000 rows, and I want to select random rows, but without repeating records.

    If I don't prevent repeating records MySQL works fine, this is a PHP code fragment:


    $result = mysql_query("SELECT MIN(id) AS min , MAX(id) AS max FROM tabla");$rango = mysql_fetch_object($result);$rand_id = mt_rand($rango->min, $rango->max);$result = mysql_query("SELECT * FROM tabla WHERE id >= $rand_id LIMIT 0,1");


    This is soo fast, MySQL rocks

    For preventing repeating rows, I'm making this:

    1) There's another column to mark if a record was selected previosly or not

    2) Select rows not market at selected

    3) Get the record and mark it as selected

    4) Go to 1)

    The step 2) has a WHERE statement that slows de query to more than 10 seconds..

    This is the table definition:



    mysql> SHOW CREATE TABLE ip_assigned \G*************************** 1. row *************************** Table: ip_assignedCreate Table: CREATE TABLE `ip_assigned` ( `id` int(11) NOT NULL auto_increment, `ip` varchar(256) NOT NULL, `assigned_time` timestamp NOT NULL default '0000-00-00 00:00:00', `crawler` int(11) NOT NULL, PRIMARY KEY USING BTREE (`id`), KEY `index_ip` (`ip`(16)), KEY `index_crawler` (`crawler`,`ip`)) ENGINE=MyISAM AUTO_INCREMENT=871539 DEFAULT CHARSET=utf81 row in set (0.00 sec)mysql>



    This is the query to get not market records:


    // Get COUNT(*) FROM cache, cache->total_ips$sql = "SELECT val AS total_ips FROM cache WHERE param = 'total_ips'";$result = mysql_query($sql);$cache = mysql_fetch_object($result);// Calculate a random offset for the LIMIT$offset = mt_rand(1, $cache->total_ips);// Get $total_requested_ips random IPs$sql = "SELECT ip FROM ip_assigned WHERE crawler = 0 LIMIT $offset,$total_requested_ips";$result = mysql_query($sql);


    This last query is very slow, because MySQL needs to scan a lot of rows, this is the output of EXPLAIN command:


    mysql> DESCRIBE SELECT ip FROM ip_assigned WHERE crawler = 0 LIMIT 705607,10240 ;+----+-------------+-------------+------+---------------+---------------+---------+-------+--------+-------------+| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |+----+-------------+-------------+------+---------------+---------------+---------+-------+--------+-------------+| 1 | SIMPLE | ip_assigned | ref | index_crawler | index_crawler | 4 | const | 813308 | Using index | +----+-------------+-------------+------+---------------+---------------+---------+-------+--------+-------------+1 row in set (0.00 sec)


    What can I do to improve this query? perhaps another method/idea to prevent repeating random registers? Thanks!

    Regards,

  • #2
    Dear Srs,

    This is the output for mysqldumpslow:


    ~# mysqldumpslow /var/log/mysql/mysql-slow.logReading mysql slow query log from /var/log/mysql/mysql-slow.logCount: 3 Time=2.00s (6s) Lock=0.00s (0s) Rows=10240.0 (30720), root[root]@localhost SELECT ip FROM ip_assigned WHERE crawler = N LIMIT N,N


    Regards,

    Comment


    • #3
      I was comming here looking for how to do the same (effiecnty get random rows), your idea looks very promising, as for avoiding duplicates, how about just keeping track of the 'done' in php?



      $result = mysql_query("SELECT MIN(id) AS min , MAX(id) AS max FROM tabla");$rango = mysql_fetch_object($result);$done = array(0);for($i=0;$i<10;$i++) { $rand_id = mt_rand($rango->min, $rango->max); $result = mysql_query("SELECT * FROM tabla WHERE id >= $rand_id AND id NOT IN (".implode(',',$done).") LIMIT 0,1"); $resobj = mysql_fetch_object($result); //do something with the object... $done[] = $resobj->id;}


      (note the 0 in the definition of the array - that is so there is something - that will never match - you could do it by making IN clause only appear if something (amongst other ways) - but I like this simple way - should not add to the query time?)

      -- note this is untested - about to go and try it!

      (edited as already spotted a mistake!)

      Comment


      • #4
        Hey,

        I too was faced with the same problem with over 17,000 rows,
        i needed to return a rand() without duplicates but decided to use PHP for rand() when i got the array i needed.

        So I ended up using limit clauses to grab the result set
        then I used NOT IN condition to avoid duplicates


        $feed_array=$dbc->query("select m.market_feed_id from market_feed m WHERE m.market_feed_id not in(select p.processed_id from que_processed p) LIMIT 1000");if($dbc->db_num_rows($feed_array) >0){ $count=0; $max_que=1000; while($Mvalues=$dbc->db_fetch_array($feed_array)) { $count++; // Insert into ads processed $insert_array=array('processed_id'=>$Mvalues['market_feed_id'], 'processed_type'=>'market'); $dbc->db_insert('que_processed',$insert_array); if($count == $max_que) { break; } }}else{ echo 'NO MORE IDS TO GRAB!!!!!

        ';}


        I have 2 tables
        1) Where the data exists
        2) A holding table (que_processed)

        As you can see I have selected only the id's
        not in(que_processed)

        So now i have 1000 fresh IDs to work with in an
        array, to do all types of nasty crap too!

        You should use a db_num_rows() > 0 to check if the
        above query has processed all the id's have been
        processed are not

        Preformance wise?? Not sure yet, but the code SNAPS
        very fast, mabee its because im limiting 1000 at a time?

        Have Fun
        Rick
        http://www.americashostingnetwork.com

        Comment

        Working...
        X