I am working on a tough mySQL problem, and need some help optimizing an ORDER BY query. Here is a table description:
CREATE TABLE `urls` (
`urls_id` bigint(32) NOT NULL auto_increment,
`urls_url` varchar(96) collate utf8_unicode_ci NOT NULL default '',
`urls_last_updated` datetime NOT NULL default '0000-00-00 00:00:00',
`urls_added` datetime NOT NULL default '0000-00-00 00:00:00',
`urls_active` tinyint(1) NOT NULL default '0',
`urls_type` enum('MAIN','CATEGORY','LISTING') collate utf8_unicode_ci NOT NULL default 'LISTING',
`urls_name` varchar(255) collate utf8_unicode_ci NOT NULL default '',
`urls_update_count` int(16) NOT NULL default '0',
`urls_hold_until` datetime NOT NULL default '0000-00-00 00:00:00',
`urls_last_discovered` datetime NOT NULL default '0000-00-00 00:00:00',
`urls_dropdown_name` varchar(64) collate utf8_unicode_ci NOT NULL default '',
`urls_flag` enum('NONE','EXPIRED','FLAGGED','REMOVED') collate utf8_unicode_ci NOT NULL default 'NONE',
PRIMARY KEY (`urls_id`),
UNIQUE KEY `urls_unique` (`urls_url`),
KEY `get_urls` (`urls_active`,`urls_hold_until`,`urls_flag`,`urls _type`),
KEY `last_updated` (`urls_last_updated`),
KEY `get_urls2` (`urls_active`,`urls_hold_until`,`urls_flag`,`urls _type`,`ur ls_last_updated`)
) ENGINE=InnoDB AUTO_INCREMENT=7621125 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
select count(*) from urls;
+----------+
| count(*) |
+----------+
| 7620415 |
+----------+
1 row in set (28.24 sec)
The indexes are experimental to do the query I am trying to optimize below, and are not permanent to the schema.
here is the query I am trying to optimize:
SELECT urls_id, urls_url, urls_type, urls_last_updated FROM urls WHERE urls_active = '1' and urls_flag = 'NONE' and urls_hold_until < now() and urls_type = 'CATEGORY' ORDER BY urls_last_updated LIMIT 100;
Explain on this query:
+----+-------------+-------+------+--------------------+---- ------+---------+-------+--------+-------------------------- ---+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+--------------------+---- ------+---------+-------+--------+-------------------------- ---+
| 1 | SIMPLE | urls | ref | get_urls,get_urls2 | get_urls | 1 | const | 418179 | Using where; Using filesort |
+----+-------------+-------+------+--------------------+---- ------+---------+-------+--------+-------------------------- ---+
Filesort is a big problem, so if I remove ORDER BY the "Using filesort" goes away. Here is the explain on this query (dropping the ORDER BY)
SELECT urls_id, urls_url, urls_type, urls_last_updated FROM urls WHERE urls_active = '1' and urls_flag = 'NONE' and urls_hold_until < now() and urls_type = 'CATEGORY' LIMIT 100;
+----+-------------+-------+-------+---------------+-------- ---+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-------- ---+---------+------+---------+-------------+
| 1 | SIMPLE | urls | range | get_urls2 | get_urls2 | 9 | NULL | 3815726 | Using where |
+----+-------------+-------+-------+---------------+-------- ---+---------+------+---------+-------------+
1 row in set (0.00 sec)
The first query with the ORDER BY takes 1 minute 38 seconds over 7 million rows. The second query takes 0.01 seconds (with query cache turned off)
How do I get this query optimized, understanding that I MUST have the order by in there, or the equivalent.
Building another table, running periodic summary queries into another table are not acceptable solutions, unfortunately (
CREATE TABLE `urls` (
`urls_id` bigint(32) NOT NULL auto_increment,
`urls_url` varchar(96) collate utf8_unicode_ci NOT NULL default '',
`urls_last_updated` datetime NOT NULL default '0000-00-00 00:00:00',
`urls_added` datetime NOT NULL default '0000-00-00 00:00:00',
`urls_active` tinyint(1) NOT NULL default '0',
`urls_type` enum('MAIN','CATEGORY','LISTING') collate utf8_unicode_ci NOT NULL default 'LISTING',
`urls_name` varchar(255) collate utf8_unicode_ci NOT NULL default '',
`urls_update_count` int(16) NOT NULL default '0',
`urls_hold_until` datetime NOT NULL default '0000-00-00 00:00:00',
`urls_last_discovered` datetime NOT NULL default '0000-00-00 00:00:00',
`urls_dropdown_name` varchar(64) collate utf8_unicode_ci NOT NULL default '',
`urls_flag` enum('NONE','EXPIRED','FLAGGED','REMOVED') collate utf8_unicode_ci NOT NULL default 'NONE',
PRIMARY KEY (`urls_id`),
UNIQUE KEY `urls_unique` (`urls_url`),
KEY `get_urls` (`urls_active`,`urls_hold_until`,`urls_flag`,`urls _type`),
KEY `last_updated` (`urls_last_updated`),
KEY `get_urls2` (`urls_active`,`urls_hold_until`,`urls_flag`,`urls _type`,`ur ls_last_updated`)
) ENGINE=InnoDB AUTO_INCREMENT=7621125 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
select count(*) from urls;
+----------+
| count(*) |
+----------+
| 7620415 |
+----------+
1 row in set (28.24 sec)
The indexes are experimental to do the query I am trying to optimize below, and are not permanent to the schema.
here is the query I am trying to optimize:
SELECT urls_id, urls_url, urls_type, urls_last_updated FROM urls WHERE urls_active = '1' and urls_flag = 'NONE' and urls_hold_until < now() and urls_type = 'CATEGORY' ORDER BY urls_last_updated LIMIT 100;
Explain on this query:
+----+-------------+-------+------+--------------------+---- ------+---------+-------+--------+-------------------------- ---+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+--------------------+---- ------+---------+-------+--------+-------------------------- ---+
| 1 | SIMPLE | urls | ref | get_urls,get_urls2 | get_urls | 1 | const | 418179 | Using where; Using filesort |
+----+-------------+-------+------+--------------------+---- ------+---------+-------+--------+-------------------------- ---+
Filesort is a big problem, so if I remove ORDER BY the "Using filesort" goes away. Here is the explain on this query (dropping the ORDER BY)
SELECT urls_id, urls_url, urls_type, urls_last_updated FROM urls WHERE urls_active = '1' and urls_flag = 'NONE' and urls_hold_until < now() and urls_type = 'CATEGORY' LIMIT 100;
+----+-------------+-------+-------+---------------+-------- ---+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-------- ---+---------+------+---------+-------------+
| 1 | SIMPLE | urls | range | get_urls2 | get_urls2 | 9 | NULL | 3815726 | Using where |
+----+-------------+-------+-------+---------------+-------- ---+---------+------+---------+-------------+
1 row in set (0.00 sec)
The first query with the ORDER BY takes 1 minute 38 seconds over 7 million rows. The second query takes 0.01 seconds (with query cache turned off)
How do I get this query optimized, understanding that I MUST have the order by in there, or the equivalent.
Building another table, running periodic summary queries into another table are not acceptable solutions, unfortunately (
Comment