GET 24/7 LIVE HELP NOW

Announcement

Announcement Module
Collapse
No announcement yet.

$100 Paypal if you can optimize this query

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • $100 Paypal if you can optimize this query

    I'm trying to optimize a query for my forums. The query is related to "my posts" functionality where a user can see a list of all the threads they've posted in, ordered by most recent activity. Here is a stripped down version of the query:


    mysql> EXPLAIN SELECT SQL_NO_CACHE SQL_CALC_FOUND_ROWS DISTINCT Main.B_Number -> FROM BB_Posts AS Main, BB_Posts AS Reply -> WHERE Main.B_Number = Reply.B_Main -> AND Reply.B_PosterId = 654 -> AND Reply.B_Board IN ('1', '2', '3') -> ORDER BY Main.B_Last_Post DESC;+----+-------------+-------+--------+---------------+---------+---------+-------------------------+------+----------------------------------------------+| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |+----+-------------+-------+--------+---------------+---------+---------+-------------------------+------+----------------------------------------------+| 1 | SIMPLE | Reply | ref | ID_ndx | ID_ndx | 4 | const | 7872 | Using where; Using temporary; Using filesort | | 1 | SIMPLE | Main | eq_ref | PRIMARY | PRIMARY | 4 | WWWThreads.Reply.B_Main | 1 | | +----+-------------+-------+--------+---------------+---------+---------+-------------------------+------+----------------------------------------------+


    Here's a dump of a small subset of the table: example.sql
    (P.S. I know the structure is kind of ugly eg. all the enums and some of the indexes, it's due to attempts at backwards-compatible optimization.)

    I'll pay $100 via Paypal if someone can suggest an alternate query or an index which gets rid of the temporary + filesort without affecting the results.

    Thanks in advance for any help, this is driving me crazy. confused:

    Regards,

    -Y

  • #2
    You might try something along these lines:

    SELECT Main.B_NumberFROM BB_Posts AS MainWHERE Main.B_Number IN ( SELECT Reply.Number FROM BB_Posts AS Reply WHERE Reply.B_PosterId = 654 AND Reply.B_Board IN ('1', '2', '3'))ORDER BY Main.B_Last_Post DESC;

    Comment


    • #3
      Hi James,

      Thank you so much for your prompt response! I really appreciate your taking the time to look at this for me. ) I actually did try a similar approach using subqueries... although it eliminates the temporary table, it must perform a filesort on the entire posts table which unfortunately is significantly slower. For reference, the original query usually takes between 2 - 4 seconds. The new query hangs on "preparing" for 30+ seconds and I have to kill it before it completes.


      +----+--------------------+-------+-----------------+---------------+---------+---------+------+---------+-----------------------------+| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |+----+--------------------+-------+-----------------+---------------+---------+---------+------+---------+-----------------------------+| 1 | PRIMARY | Main | ALL | NULL | NULL | NULL | NULL | 5079927 | Using where; Using filesort | | 2 | DEPENDENT SUBQUERY | Reply | unique_subquery | PRIMARY | PRIMARY | 4 | func | 1 | Using where | +----+--------------------+-------+-----------------+---------------+---------+---------+------+---------+-----------------------------+

      I messed around with different index setups, but even using FORCE INDEX the main table always examines all 5 million rows. I may be missing something, since I don't have a lot of experience with subqueries (we used mySQL 3 until recently). However, at least as written, this approach seems too slow to be practical.

      Thanks again for your reply though!

      Regards,

      -Y

      Comment


      • #4
        Well, my query was wrong anyway, it should have been:


        SELECT Main.B_NumberFROM BB_Posts AS MainWHERE Main.B_Number IN ( SELECT DISTINCT Reply.B_Main FROM BB_Posts AS Reply WHERE Reply.B_PosterId = 654 AND Reply.B_Board IN ('1', '2', '3'))ORDER BY Main.B_Last_Post DESC;


        I don't think that will help with your filesort, though.

        I'll think about it, though.

        Comment


        • #5
          Here is a solution:

          Create a new key:

          alter table BB_Posts add index new_index3 (B_Last_Post, B_PosterId, B_Board);


          Now slightly modify your query:

          SELECT BB_Posts.B_Number FROM BB_Posts FORCE INDEX (new_index3) WHERE BB_Posts.B_Last_Post = 1 and BB_Posts.B_PosterId = 654 and BB_Posts.B_Board in (1,2,3) ORDER BY BB_Posts.B_Last_Post desc;


          Note the 'FORCE INDEX'

          Now explain:

          +----+-------------+----------+-------+---------------+------------+---------+------+------+-------------+| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |+----+-------------+----------+-------+---------------+------------+---------+------+------+-------------+| 1 | SIMPLE | BB_Posts | range | new_index3 | new_index3 | 9 | NULL | 3 | Using where |+----+-------------+----------+-------+---------------+------------+---------+------+------+-------------+


          Do I win? : )

          Comment


          • #6
            Thanks so much guys for your continued assistance. Jon unfortunately your example isn't exactly what I'm trying to do, but it's my fault for not explaining it better. I've been working with this stupid BB for so long that I forgot people aren't born with an inherent knowledge of its data structure.

            The complication arises from the fact that the BB supports threaded discussion, and each branch will have different last post info. Please consider these two threads:


            + Test Post| B_Number = 2, B_PosterId = 666, B_Posted = 1200000000, B_LastPost = 1200000999|+-- Re: Test Post| B_Number = 3, B_PosterId = 420, B_Posted = 1200000001, B_LastPost = 1200000002| || +-- Re: Re: Test Post| B_Number = 4, B_PosterId = 123, B_Posted = 1200000002, B_LastPost = 1200000002|+-- Re: Test Post BUMP B_Number = 6, B_PosterId = 987, B_Posted = 1200000999, B_LastPost = 1200000999


            + Another Thread| B_Number = 1, B_PosterId = 808, B_Posted = 1100000000, B_LastPost = 1200000888|+-- Re: Another Thread BUMP B_Number = 5, B_PosterId = 123, B_Posted = 1200000888, B_LastPost = 1200000888


            If I'm user #123 and I want to retrieve the main post # for threads I've posted in ordered by most recent activity, the correct order is:
            B_Main = 2 (Last post = 1200000999)
            B_Main = 1 (Last post = 1200000888)

            However if I do something like "SELECT B_Main FROM Posts WHERE B_PosterId = 123 ORDER BY B_LastPost DESC", I'll end up with this:
            B_Main = 1 (Last post = 1200000888)
            B_Main = 2 (Last post = 1200000002)

            (Note that in this example the main post ID order corresponds with the last post stamp order, but that can't be assumed.)

            Anyway that's why my query joins the posts table on itself, so I can get the true last post stamp from the thread's parent, not the last post stamp for whatever particular branch of the thread the user posted in. The alternative is to add a field like B_ThreadLastPost or something, which would contain the same value for each post in a thread: the stamp of the very last post made. However, unless I have no choice I hate to throw out a working query which is almost good enough, duplicate the same data unnecessarily, and deal with populating this field + modifying the forum to use it.

            I hope maybe this is a little more clear?

            Anyway I get the sinking feeling this may not be possible with the existing structure and I'll have to add an additional field. But I hope I'm wrong and I'm still interested in any fresh insight into the problem. I'll still pay $25 to you guys who are making a good faith effort to help, even if the solutions aren't exactly what I need.

            Thanks again!!

            -Y

            Comment


            • #7
              Okay, here is another solution, sort of (of course). It doesn't use filesort, but it uses temporary table:

              Add this index:

              alter table BB_Posts add index index7(B_Last_Post, B_Number);


              And then use the subquery proposed above:

              SELECT Main.B_Number FROM BB_Posts AS Main WHERE Main.B_Number IN ( SELECT DISTINCT Reply.B_Main FROM BB_Posts AS Reply WHERE Reply.B_PosterId = 654 AND Reply.B_Board IN ('1', '2', '3')) ORDER BY Main.B_Last_Post DESC;


              Finally, the explain:

              explain SELECT Main.B_Number FROM BB_Posts AS Main WHERE Main.B_Number IN ( SELECT DISTINCT Reply.B_Main FROM BB_Posts AS Reply WHERE Reply.B_PosterId = 654 AND Reply.B_Board IN ('1', '2', '3')) ORDER BY Main.B_Last_Post DESC;+----+--------------------+-------+-------+--------------------------------------------------------------------------+--------+---------+-------+------+------------------------------+| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |+----+--------------------+-------+-------+--------------------------------------------------------------------------+--------+---------+-------+------+------------------------------+| 1 | PRIMARY | Main | index | NULL | index7 | 8 | NULL | 100 | Using where; Using index || 2 | DEPENDENT SUBQUERY | Reply | ref | board_topic_index,B_Board,w3t_Postsindex3,ID_ndx,n ew_index,index5,index8 | ID_ndx | 4 | const | 1 | Using where; Using temporary |+----+--------------------+-------+-------+--------------------------------------------------------------------------+--------+---------+-------+------+------------------------------+


              What do you think?

              Comment


              • #8
                Oh man... looking at the EXPLAIN I thought you had it for sure... yeah it says it examines all the rows for 'Main', but it's using an index and I'm limiting the results to 25 at a time so no big deal, right? With high hopes I added an index on B_Last_Post, B_Number and tried the query on the full table, and...


                +-----+-----------+--------------------+------------+-------------+------+------------+--------------------------------------------------------------------------------------+| Id | User | Host | db | Command | Time | State | Info |+-----+-----------+--------------------+------------+-------------+------+------------+--------------------------------------------------------------------------------------+| 34 | root | 192.168.0.30:50525 | WWWThreads | Query | 120 | preparing | SELECT SQL_CALC_FOUND_ROWS Main.B_Number FROM posts AS Main WHERE Main.B_Number IN ( |+-----+-----------+--------------------+------------+-------------+------+------------+--------------------------------------------------------------------------------------+


                I had to kill it after two minutes of "preparing".

                It seems hard to evaluate the query performance without the full database of 5 million records, since it really looked like it would work based on the EXPLAIN output.

                I was worried I was missing something obvious but maybe that's not the case... I'm gradually making peace with the idea that I may just need a table structure better tailored to this particular functionality. Any time you want to cut your losses and take the 25 bucks, I'll gladly pay up for the thought-provoking suggestions. I'm sure this technique will come in handy in the future, even if it doesn't work in this specific case. Just send me your Paypal address, my e-mail is ythan at shroomery.org.

                Of course if you're having fun and want to keep trying that's fine too.

                Thanks again guys.

                -Y

                Comment


                • #9
                  Ythan -

                  Another way to approach it might be to split up the single query into two queries.

                  The first might get the last x number of replies from the particular user.

                  select B_Main from BB_Posts where B_PosterId = 654 and B_Board in (1,2,3) order by B_Last_Post desc limit 25;

                  put that into an array, then join it on commas and query for the main posts:

                  select B_Main from BB_Posts where B_Main in ([list]) order by B_Last_Post desc;

                  I know that it is not *quite* what you're looking for, but it might be a good approximation that performs significantly better than the single query.

                  Just some ideas -
                  Good luck!

                  Jon

                  Comment


                  • #10
                    Hierarchical data is a bitch to query efficiently.

                    The problem is that you can only use an index to select the subset of records that you want to sort OR to perform the sort, not both. If there are a large number of records returned, you're stuck with having to sort them all sans index.

                    I can't accept payment for the little assistance I might have provided. I've been helped several times on these forums myself, so I guess I'm just trying to give back.

                    Comment

                    Working...
                    X