GET 24/7 LIVE HELP NOW

Announcement

Announcement Module
Collapse
No announcement yet.

hash index lookup

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • hash index lookup

    Hi,
    After asking this question to the publishers, they suggested I post it here, so here it goes:

    I tried the hash indexing techniques mentioned in Chapter 5 (Indexing for high performance) of "High Performance MySQL" by Schwartz, Zaitsev and Tkachenko.
    Below is a description of what I tried, my setup and my results. I hope you can comment on that and let me know if I missed anything, misunderstood anything or did anything wrong.

    I have a table of (several millions of) urls which is growing a bit large (a few GiB) and I figured using a hash index on the urls might decrease this size.
    This is the original create code of my table (without hash index):
    CREATE TABLE `urls` (
    `Id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT, `URL` TEXT NOT NULL, PRIMARY KEY (`Id`), UNIQUE INDEX `Url` (`URL`(255)) USING BTREE
    )
    COLLATE='utf8_general_ci'
    ENGINE=InnoDB

    For a table with 1 million urls in it (I experimented with just a subset of my complete urls table), this results in a table of about 240 MiB.

    Using a hash index on the url column, as all suggested and explained in Chapter 5:

    CREATE TABLE `pseudohash` (

    `id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,

    `url` VARCHAR(255) NOT NULL,

    `url_crc` INT(10) UNSIGNED NOT NULL DEFAULT '0',

    PRIMARY KEY (`id`),

    INDEX `UrlHash` (`url_crc`)

    )

    COLLATE='utf8_general_ci'

    ENGINE=InnoDB

    I managed to basically half the size of the table. The same 1 million urls in this table result in a table of about 120 MiB.
    For completeness' sake, I copy the trigger I used before inserting into this table (but it's basically directly taken from Chapter 5 as well):
    SET @OLDTMP_SQL_MODE=@@SQL_MODE, SQL_MODE='STRICT_TRANS_TABLES,NO_AUTO_CREATE_USER, NO_ENGINE_SUBSTITUTION';
    DELIMITER //
    CREATE TRIGGER `pseudohash_crc_ins` BEFORE INSERT ON `pseudohash` FOR EACH ROW BEGIN SET NEW.url_crc=crc32(NEW.url); END// DELIMITER ; SET SQL_MODE=@OLDTMP_SQL_MODE;

    So I was very happy about significantly reducing storage space for this table. But then I decided to compare lookup times for both tables.
    I took the first 10.000 rows from both tables (the content of both tables are identical, apart from the url_crc column) and measured the time it took to look them up one by one, using a simple 'SELECT id FROM urls WHERE url='$url''. Obviously, because in the second case, there is no index on url, I needed an extra clause, adding up to: 'SELECT id FROM pseudohash WHERE url_crc=CRC32(($url)) AND url=($url)'.

    Doing this in a script on my system (not really relevant, since it's the relative difference I'm interested in, but in any case: Windows 8.1 64-bit, 3.4 GhZ processor, 12 GB memory) resulted in the following:
    Looking up these 10k urls in the table with normal/b-tree index: roughly 17 seconds.
    Looking up the same 10k urls in the table with hash index: roughly 1 minute and 18 seconds.

    Though significantly reducing storage space, this performance issue renders this implementation of hash index pretty much useless in my case. So my question is:
    Is this extra time just due to the fact that this extra operation has to be done ('WHERE url_crc=CRC32($url)...') and is this a known drawback of this hash index approach? Or am I missing something or did I do anything stupid to unnecessarily slow it down?

    Hope you can answer my question. If there is any additional information you need in order to be able to answer my question, I would be happy to provide you with that. Please let me know!

    cheers,
    peter
Working...
X