EmergencyEMERGENCY? Get 24/7 Help Now!

Getting around optimizer limitations with an IN() list

 | January 9, 2010 |  Posted In: Insight for Developers, MySQL

PREVIOUS POST
NEXT POST

There was a discussion on LinkedIn one month ago that caught my eye:

Database search by “within x number of miles” radius?

Anyone out there created a zipcode database and created a “search within x numer of miles” function ?
Thankful for any tips you can throw my way..

J

A few people commented that some solutions wouldn’t scale. To understand why these sorts of geographic search queries are problematic in MySQL, it’s best to show some execution plans on dummy data:

Did you notice that we estimate just as many rows on the first EXPLAIN as the second one? That doesn’t make any sense! The index covers x,y and col_a and should be eliminating a lot of searching, since there is only one row which meets this condition!

The reason for this is simply a missing feature of the MySQL optimizer – and it has to do with using x BETWEEN 30 and 40 (and it’s also true with x >= 30 AND x <= 40). Using a range like this prevents us from using the rest of the index. There is a workaround, but it’s not pretty:

The ugliest thing about this, is that in real life you wouldn’t know all your possible values of X or Y, and so you may end up with a very big IN list. The workaround to this, is to create steppings of the value X and Y that we can use for indexes:

Fantastic! The only remaining problem with this query is that it’s not quite identical to our original. In this query 60.79 will be floored to 60 (and erroneously included in our results):

However, this is a quick fix by re-including the original WHERE conditions (we are just no longer using an index on them):

Conclusion:
My examples were only on a small amount of data (16 000 rows) that fitted in memory, but the original query would have full table scanned if I didn’t use a FORCE INDEX hint. Add more data, and if X can’t filter out enough rows by itself this can create a real problem.

Workarounds are all very good, but they also make applications more difficult to maintain. If you really want to do these types of queries, you should give Sphinx a try.

PREVIOUS POST
NEXT POST
Morgan Tocker

Morgan is a former Percona employee. He was the Director of Training at Percona. He was formerly a Technical Instructor for MySQL and Sun Microsystems. He has also previously worked in the MySQL Support Team, and provided DRBD support.

9 Comments

  • Mark – The queries above work with InnoDB tables. It seems a shame to loose crash recovery just to be able to use the GIS features of MyISAM.

    [Edit: Replied a bit too quick. While both InnoDB and MyISAM support spatial data types, only MyISAM supports indexing of them. You could index an individual point, but you wouldn’t be able to have an index covering the point and the col_a, as I needed in this example (results in an error: ERROR 1210 (HY000): Incorrect arguments to SPATIAL INDEX). GIS functions like DISTANCE, WITHIN are not supported by MySQL yet. See here ]

  • Hi Morgan,

    The results of this are actually a bit confusing to me. Shouldn’t the BETWEEN queries be optimized by the Range Access Method?

    i.e., with x_y_col_a (x, y, col_a): (30, 50) <= (x, y) <= (40, 60)

    Only thing I'm not sure about is the "col_a" at the end of the query which could be messing with that specific optimization.

    Furthermore, aren't IN () statements technically considered to be range (because of the sorting and binary tree search)? They aren't noted as being able to be optimized by the "Range Access Method for Multi-Part Keys" either, so I don't see how they would perform like they are.

    Darius

  • Morgan,

    One thing I would notice the combinatorial nature of nested IN’s so you can have steps small enough so it can be looked up efficiently.

    If you have 100 elements in IN list for both X and Y MySQL will end up evaluating 10000 (X,Y) combinations doing pretty much this number of point lookups

    1000 or 10000 are reasonable values for most cases if your data fits in memory but be careful if you get into the millions.

    Also in the newer MySQL versions there is a certain cut-off – if you get way too large IN list MySQL may start using index only for certain prefix. This is a good change as previous versions could be caused OOM by creating too large cascaded IN lists 🙂

  • Darius,

    The confusion is justified 😉 It makes perfect sense that you should be able to do what I described with BETWEEN, but past the first column (x), MySQL won’t use the index any more. Hence the workaround.

    “Furthermore, aren’t IN () statements technically considered to be range”

    Range is a very generic term. Think about the differences between these two queries:

    mysql> explain select * from coordinates where id between 10 and 30;
    +—-+————-+————-+——-+—————+———+———+——+——+————-+
    | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
    +—-+————-+————-+——-+—————+———+———+——+——+————-+
    | 1 | SIMPLE | coordinates | range | PRIMARY | PRIMARY | 4 | NULL | 24 | Using where |
    +—-+————-+————-+——-+—————+———+———+——+——+————-+
    1 row in set (0.00 sec)

    mysql> explain select * from coordinates where id between 10 and 20 or id between 90 and 100;
    +—-+————-+————-+——-+—————+———+———+——+——+————-+
    | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
    +—-+————-+————-+——-+—————+———+———+——+——+————-+
    | 1 | SIMPLE | coordinates | range | PRIMARY | PRIMARY | 4 | NULL | 25 | Using where |
    +—-+————-+————-+——-+—————+———+———+——+——+————-+
    1 row in set (0.00 sec)

    Clearly they are not the same, but EXPLAIN is unable to really distinguish. The second one is really two smaller ranges.

  • Morgan,
    you said there must be a solution to cover not just a single point but rather a whole circle around a point. I developed something exactly covering the problem a few month ago: http://www.xarg.org/2009/12/people-near-you-with-mysql/

    Robert

  • Morgan,

    you said there must be a solution to cover not just a single point but rather a whole circle around a point. I’ve developed something exactly covering the problem a few month ago: http://www.xarg.org/2009/12/people-near-you-with-mysql/

    Sure, it is also bound to the r-tree indexes and spatial extension but maybe it will be full supported by innodb later. In the meantime you can use a denormalized myisam table for that job.

    Robert

  • Spatial search = Postgres.

    SELECT count(*) FROM archive_data;
    count
    ———
    3063044

    EXPLAIN ANALYZE SELECT count(*) FROM archive_data WHERE coords && ‘(45.64870078164,4.7909580578643),(45.68463321836,4.8423759421357)’::BOX;
    QUERY PLAN
    ——————————————————————————————————————————————–
    Aggregate (cost=36651.82..36651.83 rows=1 width=0) (actual time=7.104..7.104 rows=1 loops=1)
    -> Bitmap Heap Scan on archive_data (cost=756.21..36613.53 rows=15315 width=0) (actual time=3.197..5.828 rows=11686 loops=1)
    Recheck Cond: (coords && ‘(45.68463321836,4.8423759421357),(45.64870078164,4.7909580578643)’::box)
    -> Bitmap Index Scan on archive_data_coords (cost=0.00..752.38 rows=15315 width=0) (actual time=3.102..3.102 rows=11686 loops=1)
    Index Cond: (coords && ‘(45.68463321836,4.8423759421357),(45.64870078164,4.7909580578643)’::box)
    Total runtime: 6.009 ms
    (6 lignes)

Leave a Reply

 
 

Percona’s widely read Percona Data Performance blog highlights our expertise in enterprise-class software, support, consulting and managed services solutions for both MySQL® and MongoDB® across traditional and cloud-based platforms. The decades of experience represented by our consultants is found daily in numerous and relevant blog posts.

Besides specific database help, the blog also provides notices on upcoming events and webinars.
Want to get weekly updates listing the latest blog posts? Subscribe to our blog now! Submit your email address below and we’ll send you an update every Friday at 1pm ET.

No, thank you. Please do not ask me again.