GET 24/7 LIVE HELP NOW

Announcement

Announcement Module
Collapse
No announcement yet.

join / subquery / derived table performance

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • join / subquery / derived table performance

    hi all,

    i noticed some strange behaviour in the use of mysql. currently using 5.1.37-enterprise-gpl-advanced.

    It started with the following query:

    select d.Date, sum(r.FctA) a, sum(r.FctB) b
    from rpo_fct_x x
    join rpo_dim_date d on x.SID_DateId = d.SID_DateId
    where d.Date > adddate(sysdate(), interval -2 day)
    and d.Date < sysdate()
    group by d.Date

    explain:


    +----+-------------+-------+--------+----------------------- --+---------+---------+---------------------------------+--- -------+---------------------------------+
    | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
    +----+-------------+-------+--------+----------------------- --+---------+---------+---------------------------------+--- -------+---------------------------------+
    | 1 | SIMPLE | r | ALL | date_country,SID_DateId | NULL | NULL | NULL | 13860580 | Using temporary; Using filesort |
    | 1 | SIMPLE | d | eq_ref | PRIMARY | PRIMARY | 4 | dwh_rep_olap.r.SID_DateId | 1 | Using where |
    +----+-------------+-------+--------+----------------------- --+---------+---------+---------------------------------+--- -------+---------------------------------+
    2 rows in set (0.00 sec)

    mysql> show keys from rpo_fct_registrations;
    +-----------------------+------------+---------------------- --------------------------+--------------+------------------ -------+-----------+-------------+----------+--------+------ +------------+---------+
    | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
    +-------+------------+-------------------------------------- ----------+--------------+-------------------------+-------- ---+-------------+----------+--------+------+------------+-- -------+
    | r | 1 | date_country | 1 | SID_DateId | A | 2047 | NULL | NULL | | BTREE | |
    | r | 1 | date_country | 2 | SID_CountryId | A | 28876 | NULL | NULL | | BTREE | |
    | r | 1 | SID_DateId | 1 | SID_DateId | A | 2047 | NULL | NULL | | BTREE | |
    +-----------------------+------------+---------------------- --------------------------+--------------+------------------ -------+-----------+-------------+----------+--------+------ +------------+---------+
    7 rows in set (0.00 sec)

    I tried then to play around with the various parts of the query:

    mysql> select count(*) from r where SID_DateId in (select SID_DateId from d where Date>adddate(sysdate() , interval -2 day) and Date < sysdate());
    +----------+
    | count(*) |
    +----------+
    | 13858 |
    +----------+
    1 row in set (1 min 24.29 sec)

    far too long for this query...

    +----+--------------------+----------+-----------------+---- -----------+------------+---------+------+----------+------- -------------------+
    | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
    +----+--------------------+----------+-----------------+---- -----------+------------+---------+------+----------+------- -------------------+
    | 1 | PRIMARY | r | index | NULL | SID_DateId | 4 | NULL | 13860580 | Using where; Using index |
    | 2 | DEPENDENT SUBQUERY | d | unique_subquery | PRIMARY | PRIMARY | 4 | func | 1 | Using where |
    +----+--------------------+----------+-----------------+---- -----------+------------+---------+------+----------+------- -------------------+


    then i put the two values into the in clause which would be the result of the in:

    mysql> select count(*) from r where SID_DateId in (2091,2090);
    +----------+
    | count(*) |
    +----------+
    | 13858 |
    +----------+
    1 row in set (0.00 sec)

    +----+-------------+--------+-------+----------------------- --+--------------+---------+------+-------+----------------- ---------+
    | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
    +----+-------------+--------+-------+----------------------- --+--------------+---------+------+-------+----------------- ---------+
    | 1 | SIMPLE | r | range | date_country,SID_DateId | date_country | 4 | NULL | 24306 | Using where; Using index |
    +----+-------------+--------+-------+----------------------- --+--------------+---------+------+-------+----------------- ---------+

    Why does the subquery takes that much time against the two values in the in clause?

    i've rewritten then the query to use a derived table:

    select count(*) from r
    join (select SID_DateId from d where Date>adddate(sysdate(), interval -2 day ) and Date < sysdate()) d on d.SID_DateId = r.SID_DateId

    +----+-------------+--------------+------+------------------ -------+------------+---------+--------------+-------+------ -------+
    | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
    +----+-------------+--------------+------+------------------ -------+------------+---------+--------------+-------+------ -------+
    | 1 | PRIMARY | | ALL | NULL | NULL | NULL | NULL | 2 | |
    | 1 | PRIMARY | r | ref | date_country,SID_DateId | SID_DateId | 4 | d.SID_DateId | 6768 | Using index |
    | 2 | DERIVED | d | ALL | NULL | NULL | NULL | NULL | 13150 | Using where |
    +----+-------------+--------------+------+------------------ -------+------------+---------+--------------+-------+------ -------+


    +----------+
    | count(*) |
    +----------+
    | 13858 |
    +----------+
    1 row in set (0.04 sec)


    Again an acceptable performance.

    Does anyone has a tip what went wrong on the initial query and how to speed it up? Problem is I can't work with much query changes, as the queries are build automatically by an reporting tool where the influence possibilities on the queries are very less.. we're using mysql in a data warehouse environement. Storage engine is MyISAM


    Thanks
    Torsten

  • #2
    I don't see any evidence that you have an index on the Date column in your date dimensions table. If you had this index, and you had more than a few days in that table, your original query would have started by applying the restriction to the dimension table, and then joining into your fact table.

    When using a subquery, you usually want to make sure explain regards this as a derived table, and not a dependent subquery. A dependent subquery evaluates the inner query for every row returned by the outer query. Since your outer query returns 13M rows, MySQL is executing your inner query 13M times. That takes some time. Instead of using IN, try re-writing that query as a JOIN and make sure if you use the same tables or field names that they are aliased properly so that there can be no confusion about which query (inner or outer) a column or table belongs to. So you shouldn't use just SID_DateID, instead you should be using d.SID_DateID or r.SID_DateID where appropriate.

    Comment


    • #3
      there is an index on Date:


      mysql> show keys from rpo_dim_date;
      +--------------+------------+----------+--------------+----- --------+-----------+-------------+----------+--------+----- -+------------+---------+
      | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
      +--------------+------------+----------+--------------+----- --------+-----------+-------------+----------+--------+----- -+------------+---------+
      | rpo_dim_date | 0 | PRIMARY | 1 | SID_DateId | A | 13150 | NULL | NULL | | BTREE | |
      | rpo_dim_date | 0 | Date | 1 | Date | A | 13150 | NULL | NULL | | BTREE | |
      | rpo_dim_date | 1 | DateId | 1 | DateId | A | 13150 | NULL | NULL | | BTREE | |
      | rpo_dim_date | 1 | YYYYMMDD | 1 | YYYYMMDD | A | 13150 | NULL | NULL | | BTREE | |
      +--------------+------------+----------+--------------+----- --------+-----------+-------------+----------+--------+----- -+------------+---------+

      but it won't help - and full join doesn't work either... and in the initial query do you think there is something missing? for me logical the query looks fine...!?

      Comment


      • #4
        What does this say:

        EXPLAIN SELECT COUNT(*) FROM rpo_dim_date WHERE Date > ADDDATE(SYSDATE(), INTERVAL -2 DAY) and Date < SYSDATE();

        Comment


        • #5
          That plan was bad..... Problem was - too obvious - that Date is a Date field and Sysdate is a Datetime value...
          So no index was used..

          mysql> EXPLAIN SELECT COUNT(*) FROM rpo_dim_date WHERE Date > ADDDATE(SYSDATE(), INTERVAL -2 DAY) and Date < CURDATE();
          +----+-------------+--------------+-------+---------------+- -----+---------+------+------+--------------------------+
          | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
          +----+-------------+--------------+-------+---------------+- -----+---------+------+------+--------------------------+| 1 | SIMPLE | rpo_dim_date | range | Date | Date | 3 | NULL | 2214 | Using where; Using index |
          +----+-------------+--------------+-------+---------------+- -----+---------+------+------+--------------------------+


          After changing sysdate into curdate it runs perfectly..

          thanks a lot
          Torsten

          Comment


          • #6
            In my use cases curdate() (and really any non-deterministic functions around date & time) were slower. This was very noticeable in tables that had a number of rows (row cnt > 1M). We put today's date value into a variable that gets substituted & run with a date.

            select count(*) from r
            join (select SID_DateId from d where Date>adddate(sysdate(), interval -2 day ) and Date < sysdate()) d on d.SID_DateId = r.SID_DateId

            INTO
            SELECT COUNT(*) FROM r
            JOIN (SELECT SID_DateId FROM d WHERE DATE>addate("2009-11-03", interval -2 day ) and Date < "2009-11-03") d ON d.SID_DateId = r.SID_DateId;

            In the app (perl):
            SELECT COUNT(*) FROM r
            JOIN (SELECT SID_DateId FROM d WHERE DATE>addate(\"$TODAY\",interval -2 day ) and Date < \"$TODAY\") ON d.SID_DateId = r.SID_DateId;

            (I also make the interval a variable ....)

            The field named `Date` would cause a red flag for me, too close to a reserved word. What sort of date data is it? But that's another issue.

            Hope this helps!
            erin

            Comment

            Working...
            X