November 26, 2014

Four ways to optimize paginated displays

A paginated display is one of the top optimization scenarios we see in the real world. Search results pages, leaderboards, and most-popular lists are good examples. You know the design pattern: display 20 results in some most-relevant order. Show a “next” and “previous” link. And usually, show how many items are in the whole list and how many pages of results there are.

Rendering such a display can consume more resources than the entire rest of the site!

As an example, I’m looking at slow log analysis results (with our microslow patches, set to log all queries) for one client; the slow log contains 6300 seconds’ worth of queries, and the two main queries for the paginated display consumed 2850 and 380 seconds, respectively.

Why is it so expensive? I typically see queries like this:

If the ORDER BY can’t use an index (commonly the case), it uses a filesort. Suppose there are a million rows that meet any WHERE conditions. That means a million rows are retrieved, stored, filesorted, then most of them are discarded and only 20 retrieved. If the user clicks the “next” button the same process happens again, only a different 20 are retrieved. And to show the list of pages and the total count, you either a) use SQL_CALC_FOUND_ROWS (see our post on this) or b) execute a separate SELECT to count the rows.

There are ways to optimize so you don’t have to do quite so much offsetting and limiting. I wrote about this on O’Reilly’s website in an article on optimizing ranked data. But frankly it’s not that easy to do in the real world; you can usually optimize for one access method to the data at some significant cost in complexity and maintenance (which might be worth it) but not for many different ways of accessing the same data, which is more typical in websites we work on.

Beyond indexing, re-organizing data, or query optimizations, there are two big things you can do. One is caching aggressively to prevent these queries from running. The other is to rethink the paradigm. Just because everyone lays out such pages in the same way doesn’t mean you need to. Think about how you use such pages. Do you really go clicking directly to the Nth page of results, or the last page? “Hmm, it found 13928 results. Let me look at the least relevant search results for my query.” Generally not — you usually look at the most helpful stuff, which is supposed to be first in the list.

With that in mind, here are four suggestions for optimizing paginated displays that can give significantly better performance.

  1. On the first query, fetch and cache all the results. Now it’s easy to know how many results there are, and fetching subsequent pages is no extra work for the database. In this model, you get to keep your “found X rows, showing page N of M” display that many people cherish.
  2. Don’t show all results. Not even Google lets you see the millionth result. You get to see N results and after that, you’re done. Limit the results to 100 or 500 or something. Remember, the further you go into the list, the more rows you are scanning and discarding with that LIMIT. This technique works great in conjunction with the first one. If you want to show 500 results, maybe you can fetch 501 and if the 501st row exists, display “more than 500 results found.”
  3. Don’t show the total count or the intermediate links to other pages. Show only the “next” link. (If people want to see the “previous” results, they can use their browser’s back button.) You can do this by fetching one more result than you want to display — for example, fetch 21 rows and display only 20. If there’s a 21st row, render the “next” link; if not, you’re at the end of the list. This way you don’t have to calculate how many results there are, and if caching is difficult then this is a simple way to avoid some of the costs.
  4. Estimate how many results there are. Again, Google does this and nobody complains. Use EXPLAIN and look at the “rows” column — that’s a fine estimate for some scenarios. (This tip doesn’t work in as many scenarios as others, but it’s still acceptable in many.)

These suggestions can take a lot of work off the database server without impacting the user’s experience at all.

About Baron Schwartz

Baron is the lead author of High Performance MySQL.
He is a former Percona employee.

Comments

  1. Psih says:

    Good tips, I think i’ll borrow them for my next project :)

  2. Andi says:

    Hi, i went into such a scenario with one of my pages already. Sadly i couldn’t only show n pages as a result but had to show all pages.
    So i wrote a daily cron which updates the position of each row in the table and i can easily use something like
    SELECT a,b,…n FROM table WHERE category = X AND sortPos BETWEEN Y AND Z
    to fetch the rows without wasting cputime with LIMIT

    With that method i significantly reduced CPU load on my database server and got no more slow querys on that one.

  3. On option three, is there any actual benefit to not displaying the previous button? Not showing links to specific page numbers I can understand, since that requires a knowledge of the total number of pages. But simply by virtue of being on a page that is not the first one, you should know that there is a previous page. And since you should already have information on how far into the list you are, and how many records are on a given page, there shouldn’t be any additional database load needed to create the link to jump backward, just as you have created the link to jump forward.

  4. Anonymous says:

    Or just use PostgreSQL:
    # EXPLAIN ANALYZE select * from table order by a offset 100 limit 20;
    QUERY PLAN
    —————————————————————————————————————————
    Limit (cost=4.15..4.98 rows=20 width=8) (actual time=0.409..0.483 rows=20 loops=1)
    -> Index Scan using table_idx on table (cost=0.00..15287.35 rows=368254 width=8) (actual time=0.038..0.407 rows=120 loops=1)
    Total runtime: 0.531 ms
    (3 rows)

  5. Chances are you are better off fething the column values that comprise the primary key ( usually, a single column, the ‘id’ ).
    Once you do (example SELECT id FROM sometable ORDER BY someConditionThatCanBeSatisfiedByTheIndexData LIMIT offset, size), you collect all IDs and keep track of the order they are coming out the table and SELECT fields FROM sometable WHERE id IN (idsFetchedEarlier) and then simply re-arrange the rows fetched.

  6. Mark, I don’t see any benefit to not showing the previous button, now that I think about it, but if I’m going to rip out unused functionality, sometimes I get carried away :)

    About PostgreSQL, that’s fine if you have an index to order by and it’s the same index that you’re using to find rows. If the index that’s useful to find rows isn’t also useful to order by, which is usually the case, then what?

  7. chad walker says:

    There’s a an easy way to get results ordered by time if you don’t have to have page links, like #3. We use something like this on our internal site to show recent screenshots/videos. We have a table with an auto increment id as the primary key, then on the first page you do:
    SELECT id FROM foo ORDER BY id DESC LIMIT 21;
    If there’s a 21st item you put a next link which contains the id for the 20th item. On the follwoing pages then its:
    SELECT id FROM foo WHERE id < xxxx ORDER BY id DESC LIMIT 21;

  8. Baron,

    Let me elaborate:)

    What we ( at work ) wind up doing is:
    Do we need to know how many pages the results set will span?:
    {
    1. SELECT SQL_CALC_FOUND_ROWS id FROM table ORDER BY foo LIMIT offset, size
    2. Collect ids(i.e columns that comprise the primary key value)
    3. SELECT fields FROM table WHERE id IN (collectedIds)
    4. re-arrange ids
    5.SELECT FOUND_ROWS();

    1 usually requires as little time as possible for mySQL will usually consult the index fetching the primary key constituents and will, most of the times, won’t have to sort anything; it will simply walk the tree backwards or forwards directly
    2. This takes way too little time
    3. This takes as little time as it could take; mySQL can fetch the record information for the rows required as efficiently as it could ( i.e by using the primary key as a reference into the tablespaces )
    4.This also takes ‘zero’ time to process
    5. Once we have the number of rows that comprise the table, we can figure out how many pages are available, etc.
    }
    else, if we only care if a ‘next’ page is available, or not
    {
    1. SELECT id FROM table ORDER BY foo LIMIT offset, size + 1
    2. Collect ids(i.e columns that comprise the primary key value), except the last one if total rows returned == size
    3. SELECT fields FROM table WHERE id IN (collectedIds)
    4. re-arrange ids
    5. If we originally got size + 1 rows back from select(1), we know there is a next page available
    }

  9. Is this really a “performance” blog when the discussion is about limiting or eliminating functionality to meet the shortcomings of the chosen technology? That’s not our job, as technologists, nor is it realistic on real world products. How about choosing a real RDBMS that solved these issues years ago: Oracle? The uneducated answer to this is “the prohibitive cost of it.” Let’s discuss the real costs of limiting functionality, coding around MySQL’s inexcusable inabilities or deploying memcached servers to cache effectively primary key look ups.

    Make the switch. The learning curve is steep but you won’t regret it.

  10. The downside here is that if you’re at all interested in your content being indexed by search engines, you really can’t use the next/previous links or limit the results to the first few pages.

  11. Pat says:

    Three points, since I looked at optimizing a similar problem not too long ago.

    1) SQL_COUNT_FOUND_ROWS can make your query much, much more expensive.

    Consider:

    select * from log order by timestamp limit 1,10;
    create index test on log(timestamp);

    If the optimizer is relatively bright, he won’t table scan, but will instead index scan “test”, find the first 10 rows, and exit immediately. You can get the first 10 rows in milliseconds.

    If you add SQL_COUNT_FOUND_ROWS to your query though, the optimizer will have to exhaust the query in order to determine how many rows it COULD have returned, had you asked for all of them.

    2) Consider using “force index” to force retreival by an index on your order column. Even if that’s a lousy index from a selectivity standpoint, if your limit clause is small relative to the size of your return set, you’ll get out much faster this way. From what I can tell, the MYSQL optimizer doesn’t take the limit clause into account (or if he does his strategy is flawed).

    3) On myisam, high offset limit clauses are horribly slow e.g. select * from log order by timestamp limit 100000, 10; is going to be very, very slow (Innodb is much faster at these high offset limit clauses, but still slower than a low offset). If the data you want is near the end of the result set, it’s faste to sort backwards and reverse your limit.

    e.g. if a query returns 100,000 rows

    order by timestamp limit 99990, 10 is MUCH slower than
    order by timestamp desc limit 0, 10

  12. Colnector says:

    Some nice tips in this post but they are more relevant to showing search results than other paginated display settings. Google may seem like a magic keyword but “if Google jumps off the roof, would u?” :)

    IMHO good schema design and proper caching are essential in situations where you actually need your site properly indexed and users need to be able to browse all pages.

  13. Anupam says:

    Hello Peter,
    I am trying to follow these advice
    “# Estimate how many results there are. Again, Google does this and nobody complains. Use EXPLAIN and look at the “rows” column — that’s a fine estimate for some scenarios. (This tip doesn’t work in as many scenarios as others, but it’s still acceptable in many.)”

    But how do i return the explain resultset to Java or PHP?

    Regards,
    Anupam

  14. chad walker says:

    Anupam,
    The same way you do any other query, eg:

  15. chad walker says:

    Doh! *mental note: don’t include php tags*

    mysql_connect( “host”, “user”, “passwd” );
    mysql_select_db( “db” );
    $res = mysql_query( “EXPLAIN SELECT * FROM users” );
    if( $res )
    {
    $estimate = 1;
    while( $row = mysql_fetch_array($res, MYSQL_ASSOC) )
    {
    $estimate *= $row[‘rows’];
    }
    echo “Estimate rows: $estimate.\n”;
    }
    else
    {
    echo “SQL error.\n”;
    }

  16. chad walker says:

    Oh well, it used to formatted nicely… The point is, you can run explain’s just like any other query. To get an estimate on the number of rows, multiple all of the ‘rows’ columns together (if there is more then one table involved).

  17. David BL says:

    I have recently added paging to my website. First attempt used SQL_COUNT_FOUND_ROWS everywhere. This, unfortunately, slowed things down considerably.

    My solution that works very well and has good performance and does NOT remove functionality as this article suggests is:

    I’ll explain how I do it for the forum:
    ———————————————–
    1) I have a statistics table that is updated by triggers on inserts:
    statistics_table(serviceid, count, lastinsert):
    +——–+———+—————-+
    | SERVICE| COUNTER | LASTINSERT |
    |——–|———|—————-|
    | forum | 10020202| 2008-10-03…. |
    | users | 10002| 2008-09-13…. |

    (I use InnoDB so getting count(*) is too slow)

    2) Per forum category I added the same statistics per item in the forum-category table. Again, this table is updated with the same trigger that updates the statistics_table

    +———+———+—————-+
    | CATEGORY| COUNTER | LASTINSERT |
    |———|———|—————-|
    | GENERAL | 10001| 2008-10-03…. |
    | TOPIC1 | 1002| 2008-09-13…. |
    | TOPIC2 | 4002| 2008-09-13…. |

    Then, based on the query’s where-statements I get the total rows available if its an empty where statement or the counter from the forum-category table if filtered on a category. If its anything else, like free text search or filtered on a user etc. then I fall back to the SQL_COUNT_FOUND_ROWS method.
    The tables are so small that they are always cached. I wanted to use in-memory engine, but for some reason I got empty results when querying the forum-category table when in memory.

    Making paging links is then easy using (total rows)/(rows per page)

    David

  18. David, thanks for sharing your approach. That looks like a great way to do it. I think it would be hard to apply to search results (as on a shopping site) where searching is the norm and showing by category is less used, but in this case it looks like a perfect match. Knowing your data and access patterns lets you do all kinds of good things!

  19. rar says:

    Ditto David. Maintain count tables. The End.

  20. rar, for SOME applications. It is absolutely not The End for all. For many apps I work on, that method would a) not work because there are too many permutations of criteria to count and b) if you could do it, it’d load the site much more heavily than not having counter tables.

  21. Sam says:

    Baron,

    I had a slow query that involved pagination, so I investigated after reading your post on the matter. The result of my investigation was very interesting. Basically, I had a query that was taking about 3 seconds to execute. The SELECT portion of the query was specifying which columns to fetch (i.e. SELECT table.col2, table.col2, … table.col9). When I simply changed this to “SELECT table.*” the query time dropped to 0.01 sec.! So, in the end, it did not appear that the slowness was a result of the pagination at all (the query uses ORDER BY, GROUP BY, WHERE and LIMIT and orders on a non-indexed field.) Using EXPLAIN did not show any different in the query optimization path for the two forms of query either. MySQL bug?

    Sam

  22. Bill says:

    I am with Baron on my app. There is no way to maintain count tables if they possibly searching on any combination of 30 different fields. I also tried Chad Walkers method without success. I have many joins when using count() got 107 for total rows. When multiplying got 423654. Not sure if my query has other optimization problems but that estimate is far from usable.

  23. trr says:

    @Sam,

    Could that difference have been caused by a large part of that query being cached the second time?

  24. Using the estimation of EXPLAIN is a really cool idea. For most of my paginations, I created a different approach (and even use different approaches in one application, but this one is also added to my recently published SQL class), which lacks on a single O(n) but the rest is really fast. So if you only copy the id’s this way, you can perform the query on a result set with lets say 10 millions rows under 1s (on the right hardware). Okay you should cache the result, because it is not fair on concurrency, but here the queries I run when I want to sort for the just calculated column “rank” and narrow down the result with $offset and $limit:

    # create a temporary table with all elements and a key on rank
    CREATE TEMPORARY TABLE _tmp (KEY SORT(rank DESC))
    SELECT id, rank FROM table;

    # add a primary key for a fast lookup
    ALTER TABLE _tmp ADD OFFSET INT UNSIGNED PRIMARY KEY AUTO_INCREMENT, DROP INDEX SORT, ORDER BY rank;

    # scan the primary key and get the result, maybe add a join to get real data
    SELECT id FROM _tmp WHERE OFFSET >= $offset ORDER BY OFFSET LIMIT $limit;

    Robert

  25. I forgot to mention, that using MyISAM for _tmp is great to perform statistical queries also very fast. To be able to generate the links without estimation use:

    SELECT COUNT(*) FROM _tmp;

  26. @Andi – your solution is not for everyone, but it helped me a lot. Thanks.

  27. Anton Kanevsky says:

    “Don’t show all results. Not even Google lets you see the millionth result.”

    True, Google only allows you to see the first 1000 search results. However, Amazon’s catalog has some categories with 10 million items and they show the count. Furthermore, in the MP3 section, you can go to any page in a section by modifying the URL parameters. And when you go to 5000th page, the results still load without any apparent delay. Since Amazon implemented such a catalog, it must be technically possible. However, I don’t believe that they cache all of their possible results because, as Baron pointed out:

    “For many apps I work on, that method would a) not work because there are too many permutations of criteria to count and b) if you could do it, it’d load the site much more heavily than not having counter tables.”

    What can you say about this?

Speak Your Mind

*