Announcement

Announcement Module
Collapse
No announcement yet.

Index usage with big IN()

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Index usage with big IN()

    Hi there,

    I have some troubles understanding what happens with one of my queries.

    Here is the table structures.
    There is a tiny dataset within (less than 5000 entries in each).


    CREATE TABLE `tracker_artifact` ( `id` int(11) NOT NULL auto_increment, `tracker_id` int(11) NOT NULL, `last_changeset_id` int(11) NOT NULL, `submitted_by` int(11) NOT NULL, `submitted_on` int(11) NOT NULL, `use_artifact_permissions` tinyint(1) NOT NULL default '0', PRIMARY KEY (`id`), KEY `idx_tracker_id` (`tracker_id`), KEY `idx_my` (`submitted_by`,`tracker_id`,`last_changeset_id`), KEY `idx_last_changeset_id` (`last_changeset_id`,`id`)) ENGINE=MyISAM AUTO_INCREMENT=2842 DEFAULT CHARSET=utf8;CREATE TABLE `tracker_changeset` ( `id` int(11) NOT NULL auto_increment, `artifact_id` int(11) NOT NULL, `submitted_by` int(11) default NULL, `submitted_on` int(11) NOT NULL, `email` varchar(255) default NULL, PRIMARY KEY (`id`), KEY `artifact_id` (`artifact_id`,`id`)) ENGINE=MyISAM AUTO_INCREMENT=3342 DEFAULT CHARSET=utf8;


    Then I have a query that looks like that:


    SELECT c.idFROM tracker_changeset AS c INNER JOIN tracker_artifact AS a ON (a.id = c.artifact_id)WHERE a.id IN (928,933,934,...) AND c.id IN (1398,1432,...);


    But the 2 ranges are quite big (~1900 values in each) and the query is taking a lot of time (8s) but I don't undestand why.
    Even explain is taking ages:

    +----+-------------+-------+-------+---------------------+-------------+---------+--------------+------+--------------------------+| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |+----+-------------+-------+-------+---------------------+-------------+---------+--------------+------+--------------------------+| 1 | SIMPLE | a | range | PRIMARY | PRIMARY | 4 | NULL | 1909 | Using where; Using index | | 1 | SIMPLE | c | ref | PRIMARY,artifact_id | artifact_id | 4 | a.id | 1 | Using where; Using index | +----+-------------+-------+-------+---------------------+-------------+---------+--------------+------+--------------------------+2 rows in set (3.33 sec)


    2 questions arise:
    -> What I'm doing wrong there
    -> What is the impact of huge IN() statements?

  • #2
    Hi,

    Query plan looks perfect but it's kinda estimation so we need to know what exactly the query is doing.
    Please provide output of below commands and some more information like MySQL version, my.cnf parameters etc.

    FLUSH STATUS;
    SELECT c.id FROM tracker_changeset AS c...
    SHOW STATUS;

    Profiling the query can also be helpful.

    SET PROFILING=1;
    SELECT c.id FROM tracker_changeset AS c...
    SHOW PROFILE FOR query 1;

    Comment


    • #3
      Hi Niljoshi, thanks for your answer
      here is the status (profiling not available on my old version of mysql)

      mysql> SHOW STATUS like 'handler_read%';+-----------------------+-------+| Variable_name | Value |+-----------------------+-------+| Handler_read_first | 0 || Handler_read_key | 22435 || Handler_read_next | 14418 || Handler_read_prev | 0 || Handler_read_rnd | 0 || Handler_read_rnd_next | 0 |+-----------------------+-------+6 rows in set (0.00 sec)


      BTW, in the meantime I ran an "optimize" on all tables involved and it basically solved the perf issue.

      Thus, I had to modify one of the index on tracker_changeset table (artifact_idx) that where on (artifact_id, id), id being the PK.
      Now the index is only on 'artifact_id' and the query is fast.

      I'm a bit confused because:
      - I through it was a good practice to add the PK as second index member
      - If I convert my tables to InnoDB, the query is fast.

      Comment


      • #4
        Generally large IN(...)'s aren't really that good since the only real option of execution is a range scan of an index (or the entire table). So if it is very large and will return a lot of records the performance might be less than desirable

        Not that I'm entirely sure it will change anything I would write the query this way to make it more explicit how it should be evaluated:

        SELECT c.idFROM tracker_changeset AS c INNER JOIN tracker_artifact AS a ON (a.id = c.artifact_id) AND a.id IN (928,933,934,...)WHERE c.id IN (1398,1432,...);

        Notice the move of the condition from the WHERE to the join condition, this is possible since there's an AND between the conditions in the WHERE. Makes it easier to read and the optimizer can be sure that this condition is only applicable to the tracker_artifact table.

        vaceletm wrote on Tue, 17 July 2012 16:20
        Thus, I had to modify one of the index on tracker_changeset table (artifact_idx) that where on (artifact_id, id), id being the PK.
        Now the index is only on 'artifact_id' and the query is fast.

        I'm a bit confused because:
        - I through it was a good practice to add the PK as second index member
        - If I convert my tables to InnoDB, the query is fast.
        Having the two columns part of the WHERE and the JOIN condition in the same index should be good practice. But by having that index the optimizer might choose the join order that you have now with a->c but a join order with c->a might be better depending on how many rows the first range scan returns.

        So in your case the index (id, artifact_id) might have been better than (artifact_id, id) which you had.

        The reason why InnoDB behaves differently is that the data in InnoDB is that all data is physically stored in a balanced tree on the primary key. Which means that any query that searching for a record on primary key will be able to return all data for that record immediately since it is readily available. Which means an implicit equivalent to (id, artifact_id).

        But this is just me theorizing so you will have to compare the execution plans and test it out.

        Good luck!

        Comment

        Working...
        X