GET 24/7 LIVE HELP NOW

Announcement

Announcement Module
Collapse
No announcement yet.

Very poor performance with select count(*)

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Very poor performance with select count(*)

    Hello, this is my first post )


    I'm developing a log analysis application and I'm having a performance problem that I was hoping that someome could help (thanks in advance)

    I have table QUERY that has almost 3 million rows. I also have a table Location containing rows for most cities of the world. The QUERY table has a LID (location id) that matches a specific location).

    Although the complete query is not this simple, I've discovered where the performance problem arises.

    The query is as simple as this:

    SELECT QUERY, COUNT(*) AS FREQUENCY
    FROM QUERY
    where lid=2703
    GROUP BY QUERY ORDER BY FREQUENCY DESC LIMIT 10


    When I run this simple query with other lid values it is quite fast, but I discovered that a very large percentage of the queries has lid=2703 (over 1.8 million in almost 3 million). When I run the query, the results take over 200sec to be processed.


    This is the table:

    CREATE TABLE QUERY (
    QID INT UNSIGNED NOT NULL, (query id)
    SID INT UNSIGNED NOT NULL, (session id)
    QUERY VARCHAR(255) NOT NULL,
    LID INT UNSIGNED, (location id)
    PRIMARY KEY(QID)
    )ENGINE = MYISAM;


    SHOW INDEXES FROM QUERY

    table |Non_unique |key_name |seq_in_index |Column_name |Collation |Cardinality |Sub_part |Packed |Null |Index_type

    'query', 0, 'PRIMARY', 1, 'QID', 'A', 2387075, , '', '', 'BTREE', ''
    'query', 1, 'QUERYSID_NDX', 1, 'SID', 'A', 596768, , '', '', 'BTREE', ''
    'query', 1, 'QUERYLID_NDX', 1, 'LID', 'A', 132615, , '', 'YES', 'BTREE', ''
    'query', 1, 'QUERYQ_NDX', 1, 'QUERY', 'A', 341010, , '', '', 'BTREE', ''
    'query', 1, 'QUERY_NDX_FULL', 1, 'QUERY', '', 140416, , '', '', 'FULLTEXT', ''

    (At first I had only the full index for column query, then I added the "normal" B-tree index)


    -----EXPLAIN

    1, 'SIMPLE', 'QUERY', 'index', 'QUERYLID_NDX', 'QUERYQ_NDX', '767', '', 89, 'Using where; Using temporary; Using filesort'

    How can I display the top 10 queries for some location OR many locations, with good performance?


    Could you please help?
    I know the word "urgent" is very used by this is the case...
    Thx

  • #2
    have you tried using a multiple-column index on lid,query?

    try:

    alter table frequency add index(lid,query);

    then:

    select query,count(query) as frequency
    from query
    where lid=2703 group by query order by frequency desc limit 10

    Comment


    • #3
      Thank you very much for both your quick response and proposed solution!


      It actually improved the performance!

      I've used the code you provided:

      alter table query add index(lid,query);

      NOTE: for understanding the EXPLAIN that follows, the name of the new multiple index was created with the dedault name LID.


      SELECT QUERY, COUNT(*) AS FREQUENCY
      FROM LOCATION
      JOIN QUERY ON QUERY.LID=LOCATION.LID
      WHERE city='Mountain View'
      GROUP BY QUERY ORDER BY FREQUENCY DESC LIMIT 10

      takes 85.5sec

      When I run the this query I realize, using Explain, that the optimizer doesn't choose the new LID index.

      --Explain
      1, 'SIMPLE', 'LOCATION', 'ref', 'PRIMARY,LOCATION_CITY_NDX', 'LOCATION_CITY_NDX', '153', 'const', 1, 'Using where; Using temporary; Using filesort'

      1, 'SIMPLE', 'QUERY', 'ref', 'QUERYLID_NDX,LID', 'QUERYLID_NDX', '5', 'logsbnfull.LOCATION.LID', 18, 'Using where'

      NOTE: The value referred in the previous post (2703) is from city Mountain View.


      So I hinted the optimizer to choose the LID index


      SELECT QUERY, COUNT(*) AS FREQUENCY
      FROM LOCATION
      JOIN QUERY USE INDEX (LID) ON QUERY.LID=LOCATION.LID
      WHERE city='Mountain View'
      GROUP BY QUERY ORDER BY FREQUENCY DESC LIMIT 10

      takes 50.28sec

      -- EXPLAIN
      1, 'SIMPLE', 'LOCATION', 'ref', 'PRIMARY,LOCATION_CITY_NDX', 'LOCATION_CITY_NDX', '153', 'const', 1, 'Using where; Using temporary; Using filesort'

      1, 'SIMPLE', 'QUERY', 'ref', 'LID', 'LID', '5', 'logsbnfull.LOCATION.LID', 18, 'Using where; Using index'



      Can I do anything more to improve performance (since queries like these are only a small part of a results page and summing many times like these will be slow).

      Thanks again for your great reply! )

      Nelson

      Comment


      • #4
        try doing a count(query) instead of a count(*)

        if your table doesn't change often, have you considered using a query_cache?

        try:
        show variables like 'query_cache%'

        if you're feeling ambitious, maybe a redesign or additional table could be in order; if you only care about aggregation why not store the count as a column that gets incremented when you execute a query? This would give you instant aggregation results, and if this table was merely added to your current schema, it shouldn't affect performance.

        CREATE TABLE QUERY_count (
        QUERY VARCHAR(255) NOT NULL primary key unique,
        query_count int unsigned not null default '0',
        LID INT UNSIGNED,
        index(query_count),
        index(LID)
        )ENGINE = MYISAM;

        insert ignore into query_count (query) vlaues('select');
        update QUERY_count set query_count = query_count + 1 where query='select';

        Comment


        • #5
          YOU REALLY LIKE CAPS LOCK )

          >> try doing a count(query) instead of a count(*)
          bollocks, since query is NOT NULL, it is exactly the same query.

          Try this:

          SELECT QUERY, COUNT(*) AS FREQUENCY
          FROM QUERY
          WHERE QUERY.lid=(SELECT lid FROM LOCATION WHERE city='Mountain View')
          GROUP BY QUERY ORDER BY FREQUENCY DESC LIMIT 10

          And include HAVING COUNT(*) > 1 if there are many COUNT(*)'s equal to 1 that are guaranteed to never be in the output.

          Do you really need VARCHAR(255)? Shorter length will speed up your query.

          Comment


          • #6
            gmouse > why would a shorter varchar speed up the query? I always thought that the length was just a semantic limit on length?

            Comment


            • #7
              c113345@tyldd.com wrote on Mon, 24 August 2009 17:49

              gmouse > why would a shorter varchar speed up the query? I always thought that the length was just a semantic limit on length?

              It will not speed up my query if it is properly indexed, but it will speed up yours. Google for Generosity Can Be Unwise

              Comment


              • #8
                Thank you for your help! I'm still having performance issues...

                I have:
                -a table with user queries (QUERY)
                -a table with locations (LOCATIONS)
                -a table that relates locations and time with queries (LOC_TIME_QUERIES)

                Among many other things, I need to create a frequency chart based on the three levels (query, time, location). For every point of the chart I'm using the following query to get the frequency:

                SELECT COUNT(QUERY) AS FREQUENCY
                FROM LOC_TIME_QUERY
                JOIN LOCATION ON LOCATION.LID=LOC_TIME_QUERY.LID
                JOIN QUERY ON QUERY.QID=LOC_TIME_QUERY.QID
                WHERE MATCH (QUERY) AGAINST ('music' IN BOOLEAN MODE)
                AND TIME>= '20081201000000' AND TIME < ' 20081203000000'


                It is slow. So I decided to check a lighter version considering only the query and the time:

                SELECT COUNT(*) AS FREQUENCY
                FROM LOC_TIME_QUERY
                JOIN QUERY ON QUERY.QID=LOC_TIME_QUERY.QID
                WHERE MATCH (QUERY) AGAINST ('portugal' IN BOOLEAN MODE)
                and time >= '20081201000000' and time < ' 20081203000000'

                it takes several seconds (10sec or so) to retrieve the results.
                Considering that this is to be repeated for all the point of the chart...it takes quite a while.

                NOTE: When separated, they are fast. That is, if I only ask for the query results that match it is fast....
                if I only ask for the time results it is fast....
                but not when joined!!!
                I know that the query part is performed first... but I see that afterwards the time is not using the index


                ---EXPLAIN

                ID|SELECT TYPE|TABLE|TYPE|POSSIBLE_KEYS|KEY|KEY LEN|REF|ROWS|EXTRA
                1, 'SIMPLE', 'QUERY', 'fulltext', 'PRIMARY,QUERY_NDX_FULL', 'QUERY_NDX_FULL', '0', '', 1, 'Using where'

                1, 'SIMPLE', 'LOC_TIME_QUERY', 'ref', 'PRIMARY,LTQ_TIME_NDX,LTQ_QID_NDX,LTQ_TIMELID_NDX, LTQ_QIDTIM E_NDX', 'LTQ_QID_NDX', '4', 'logsbnfull.QUERY.QID', 1, 'Using where'

                --------

                CREATE TABLE LOC_TIME_QUERY (
                QID INT UNSIGNED NOT NULL,
                TIME DATETIME,
                LID INT UNSIGNED,
                PRIMARY KEY (QID, TIME)
                )ENGINE = MYISAM;


                CREATE INDEX LTQ_TIME_NDX USING BTREE ON LOC_TIME_QUERY(TIME);
                CREATE INDEX LTQ_LID_NDX USING BTREE ON LOC_TIME_QUERY(LID);
                CREATE INDEX LTQ_QID_NDX USING BTREE ON LOC_TIME_QUERY(QID);


                I've tried other combinations of indexes...

                Can I have your hints about improving the performance when considering the relation of query, time and location?

                Thx

                Nelson

                Comment


                • #9
                  Try a multi-column index on (QID,TIME,LID) in the table LOC_TIME_QUERIES.

                  And make sure LOCATIONS.LID is indexed (it is probably a primary key, which is fine).

                  Comment


                  • #10
                    I've tried it but the optimizer chooses the index for the QID alone. Even in the case of the second query I presented (the one with only query and time) it doesn't choose the primary key, which is (qid,time).

                    I've tried to hint the optimizer to use other keys but doesn't get any better.

                    While running, I can't find a way to take advantage of the time indexation....

                    Thanks

                    Nelson

                    Comment


                    • #11
                      Full explain output?

                      And try IGNORE INDEX / FORCE INDEX.

                      Comment


                      • #12
                        As I stated earlier, to simplify, right now I will only consider query and time.


                        SELECT COUNT(*) AS FREQUENCY
                        FROM LOC_TIME_QUERY2
                        JOIN QUERY ON QUERY.QID=LOC_TIME_QUERY2.QID
                        WHERE MATCH (QUERY) AGAINST ('portugal' IN BOOLEAN MODE)
                        and time >= '20081202000000' and time < ' 20081202210000'

                        ----> 1891
                        4,8sec

                        ID|SELECT_TYPE|TABLE|TYPE|POSSIBLE_KEYS|KEY|KEY_LE N|REF|ROWS |EXTRTA
                        1, 'SIMPLE', 'QUERY', 'fulltext', 'PRIMARY,QUERY_NDX_FULL', 'QUERY_NDX_FULL', '0', '', 1, 'Using where'
                        1, 'SIMPLE', 'LOC_TIME_QUERY2', 'ref', 'PRIMARY,LTQ2_TIME_NDX,LTQ2_QID_NDX,QIDTIMELID_NDX ', 'LTQ2_QID_NDX', '4', 'logsbnfull.QUERY.QID', 1, 'Using where'



                        The primary key is (qid,time). But I also indexed the following:


                        SHOW INDEXES FROM LOC_TIME_QUERY2


                        TABLE|NON_UNIQUE|KEY_NAME|SEQ_IN_INDEX|COLUMN_NAME |COLLATION |CARINALITY|SUB_PART|PACKED
                        'loc_time_query2', 0, 'PRIMARY', 1, 'QID', 'A', , , '', '', 'BTREE', ''
                        'loc_time_query2', 0, 'PRIMARY', 2, 'TIME', 'A', 2387075, , '', '', 'BTREE', ''
                        'loc_time_query2', 1, 'LTQ2_TIME_NDX', 1, 'TIME', 'A', 2387075, , '', '', 'BTREE', ''
                        'loc_time_query2', 1, 'LTQ2_LID_NDX', 1, 'LID', 'A', 132615, , '', 'YES', 'BTREE', ''
                        'loc_time_query2', 1, 'LTQ2_QID_NDX', 1, 'QID', 'A', 2387075, , '', '', 'BTREE', ''
                        'loc_time_query2', 1, 'QIDTIMELID_NDX', 1, 'QID', 'A', 2387075, , '', '', 'BTREE', ''
                        'loc_time_query2', 1, 'QIDTIMELID_NDX', 2, 'TIME', 'A', 2387075, , '', '', 'BTREE', ''
                        'loc_time_query2', 1, 'QIDTIMELID_NDX', 3, 'LID', 'A', 2387075, , '', 'YES', 'BTREE', ''




                        If I use

                        ...
                        FROM LOC_TIME_QUERY2 use index (primary)
                        ...

                        or if

                        ...
                        FROM LOC_TIME_QUERY2 force index (primary)
                        ...

                        I doesn't get better...or significantly better. 4.5sec


                        Explain (with force index):

                        1, 'SIMPLE', 'QUERY', 'fulltext', 'PRIMARY,QUERY_NDX_FULL', 'QUERY_NDX_FULL', '0', '', 1, 'Using where'
                        1, 'SIMPLE', 'LOC_TIME_QUERY2', 'ref', 'PRIMARY', 'PRIMARY', '4', 'logsbnfull.QUERY.QID', 23870, 'Using where; Using index'


                        -----------

                        When I try:
                        ...
                        FROM LOC_TIME_QUERY2 ignore index (LTQ2_QID_NDX)
                        ...


                        the index the optimizer chooses the (qid,time,lid) index...which looks appropriate, but it gets worse: (5.1s)

                        EXPLAIN (with ignore index):

                        1, 'SIMPLE', 'QUERY', 'fulltext', 'PRIMARY,QUERY_NDX_FULL', 'QUERY_NDX_FULL', '0', '', 1, 'Using where'
                        1, 'SIMPLE', 'LOC_TIME_QUERY2', 'ref', 'PRIMARY,LTQ2_TIME_NDX,QIDTIMELID_NDX', 'QIDTIMELID_NDX', '4', 'logsbnfull.QUERY.QID', 1, 'Using where; Using index'




                        Your atention to this problem of mine is very appreciated. Thanks.
                        Nelson

                        Comment


                        • #13
                          The (qid,time,lid) will probably be better for the full query. Your PK suffices for the light query. Check your full query after all optimization is done; if (qid,time,lid) turns out to be not so much faster (or even slower) than the PK-thing, just drop (qid,time,lid).


                          I doubt the join is your problem. How fast is this query?

                          SELECT COUNT(*) AS FREQUENCY
                          FROM LOC_TIME_QUERY2
                          WHERE MATCH (QUERY) AGAINST ('portugal' IN BOOLEAN MODE)

                          And this one

                          SELECT QID
                          FROM LOC_TIME_QUERY2
                          WHERE MATCH (QUERY) AGAINST ('portugal' IN BOOLEAN MODE)

                          Comment

                          Working...
                          X