Downloads

Blog

How to Speed Up Pattern Matching Queries

March 19, 2018

Author

Tibor Korocz

Insight for DBAs

MySQL

Share this Post:

From time to time I see pattern matching queries with conditions that look like this: “where fieldname like ‘%something%’ “. MySQL cannot use indexes for these kinds of queries, which means it has to do a table scan every single time.

(That’s really only half true — there are the FullText indexes. In another blog post I will cover FullText indexes as well.)

I recently was trying to find a solution, and my friend Charles Nagy reminded me of Trigrams. Let me show you the Trigram of the name Daniel:

daniel:
dan
ani
nie
iel

daniel:

dan

ani

nie

iel

But how is this useful?

Let me show you an example. You have the following email schema:

CREATE TABLE `email` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `email` varchar(120) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `idx_email` (`email`)
) ENGINE=InnoDB AUTO_INCREMENT=318459 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

CREATE TABLE `email` (

`id` int(10) unsigned NOT NULL AUTO_INCREMENT,

`email` varchar(120) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

PRIMARY KEY (`id`),

KEY `idx_email` (`email`)

) ENGINE=InnoDB AUTO_INCREMENT=318459 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

With data like these:

+--------+-------------------------------+
| id     | email                         |
+--------+-------------------------------+
|   8392 | ehintz@example.com            |
| 238396 | williamson.pierre@example.net |
| 286517 | godfrey79@example.org         |
| 137291 | welch.durward@example.org     |
| 291984 | curtis67@example.net          |
...
+--------+-------------------------------+

+--------+-------------------------------+

| id | email |

+--------+-------------------------------+

| 8392 | ehintz@example.com |

| 238396 | williamson.pierre@example.net |

| 286517 | godfrey79@example.org |

| 137291 | welch.durward@example.org |

| 291984 | curtis67@example.net |

...

+--------+-------------------------------+

And we are looking for email addresses like ‘%n.pierre%’:

mysql> select * from email where email like '%n.pierre%';
+--------+-------------------------------+
| id     | email                         |
+--------+-------------------------------+
|  90587 | anderson.pierre@example.org   |
| 133352 | friesen.pierre@example.net    |
| 118937 | green.pierre@example.net      |
| 118579 | hudson.pierre@example.com     |
| 237928 | johnson.pierre@example.org    |
|  59526 | larkin.pierre@example.net     |
| 278384 | monahan.pierre@example.net    |
|  58625 | olson.pierre@example.org      |
| 306718 | rohan.pierre@example.com      |
| 200608 | wilkinson.pierre@example.org  |
| 238396 | williamson.pierre@example.net |
+--------+-------------------------------+
11 rows in set (0.12 sec)

mysql> explain select * from email where email like '%n.pierre%'G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: email
   partitions: NULL
         type: index
possible_keys: NULL
          key: idx_email
      key_len: 362
          ref: NULL
         rows: 332475
     filtered: 11.11
        Extra: Using where; Using index
1 row in set, 1 warning (0.00 sec)

mysql> show session status like '%handler%';
+----------------------------+--------+
| Variable_name              | Value  |
+----------------------------+--------+
| Handler_commit             | 1      |
| Handler_delete             | 0      |
| Handler_discover           | 0      |
| Handler_external_lock      | 2      |
| Handler_mrr_init           | 0      |
| Handler_prepare            | 0      |
| Handler_read_first         | 1      |
| Handler_read_key           | 1      |
| Handler_read_last          | 0      |
| Handler_read_next          | 318458 |
| Handler_read_prev          | 0      |
| Handler_read_rnd           | 0      |
| Handler_read_rnd_next      | 0      |
| Handler_rollback           | 0      |
| Handler_savepoint          | 0      |
| Handler_savepoint_rollback | 0      |
| Handler_update             | 0      |
| Handler_write              | 0      |
+----------------------------+--------+
18 rows in set (0.00 sec)

mysql> select * from email where email like '%n.pierre%';

+--------+-------------------------------+

| id | email |

+--------+-------------------------------+

| 90587 | anderson.pierre@example.org |

| 133352 | friesen.pierre@example.net |

| 118937 | green.pierre@example.net |

| 118579 | hudson.pierre@example.com |

| 237928 | johnson.pierre@example.org |

| 59526 | larkin.pierre@example.net |

| 278384 | monahan.pierre@example.net |

| 58625 | olson.pierre@example.org |

| 306718 | rohan.pierre@example.com |

| 200608 | wilkinson.pierre@example.org |

| 238396 | williamson.pierre@example.net |

+--------+-------------------------------+

11 rows in set (0.12 sec)

mysql> explain select * from email where email like '%n.pierre%'G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: email

partitions: NULL

type: index

possible_keys: NULL

key: idx_email

key_len: 362

ref: NULL

rows: 332475

filtered: 11.11

Extra: Using where; Using index

1 row in set, 1 warning (0.00 sec)

mysql> show session status like '%handler%';

+----------------------------+--------+

| Variable_name | Value |

+----------------------------+--------+

| Handler_commit | 1 |

| Handler_delete | 0 |

| Handler_discover | 0 |

| Handler_external_lock | 2 |

| Handler_mrr_init | 0 |

| Handler_prepare | 0 |

| Handler_read_first | 1 |

| Handler_read_key | 1 |

| Handler_read_last | 0 |

| Handler_read_next | 318458 |

| Handler_read_prev | 0 |

| Handler_read_rnd | 0 |

| Handler_read_rnd_next | 0 |

| Handler_rollback | 0 |

| Handler_savepoint | 0 |

| Handler_savepoint_rollback | 0 |

| Handler_update | 0 |

| Handler_write | 0 |

+----------------------------+--------+

18 rows in set (0.00 sec)

There are 11 email addresses, but it has to scan the whole index (318458 rows). That’s not good! Let’s try and make it better.

Trigram table

I created a table like this:

CREATE TABLE `email_trigram` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `email_id` int(10) unsigned NOT NULL,
  `trigram` char(3) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `idx_email_id` (`email_id`),
  KEY `idx_trigram` (`trigram`)
) ENGINE=InnoDB AUTO_INCREMENT=2749001 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

CREATE TABLE `email_trigram` (

`id` int(10) unsigned NOT NULL AUTO_INCREMENT,

`email_id` int(10) unsigned NOT NULL,

`trigram` char(3) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

PRIMARY KEY (`id`),

KEY `idx_email_id` (`email_id`),

KEY `idx_trigram` (`trigram`)

) ENGINE=InnoDB AUTO_INCREMENT=2749001 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

As we can see, there is an index called “trigram“.

The plan is to create a trigram for every single email addresses. I wrote the following trigger:

create trigger insert_trigram after insert ON test.email FOR EACH ROW 
BEGIN

DECLARE email_length int;
declare x int ;
declare i int ;
SELECT CHAR_LENGTH(SUBSTRING_INDEX(email,'@',1)) into email_length from test.email where email=NEW.email and id=NEW.id;

set i=1;
set x=3;
	while email_length >= 3 do

	INSERT INTO email_trigram (email_id,trigram) values (NEW.id,substring(NEW.email from i for x));
	set email_length=email_length-1;
	set i=i+1;
	end while ;

END

create trigger insert_trigram after insert ON test.email FOR EACH ROW

BEGIN

DECLARE email_length int;

declare x int ;

declare i int ;

SELECT CHAR_LENGTH(SUBSTRING_INDEX(email,'@',1)) into email_length from test.email where email=NEW.email and id=NEW.id;

set i=1;

set x=3;

while email_length >= 3 do

INSERT INTO email_trigram (email_id,trigram) values (NEW.id,substring(NEW.email from i for x));

set email_length=email_length-1;

set i=i+1;

end while ;

END

When there is an insert, it creates and inserts the trigrams into the email_trigram table. Trigrams for anderson.pierre:

mysql> select trigram from email_trigram where email_id=90587;
+---------+
| trigram |
+---------+
| and     |
| nde     |
| der     |
| ers     |
| rso     |
| son     |
| on.     |
| n.p     |
| .pi     |
| pie     |
| ier     |
| err     |
| rre     |
+---------+

mysql> select trigram from email_trigram where email_id=90587;

+---------+

| trigram |

+---------+

| and |

| nde |

| der |

| ers |

| rso |

| son |

| on. |

| n.p |

| .pi |

| pie |

| ier |

| err |

| rre |

+---------+

With the following query, we can find all the email addresses with n.pierre:

SELECT e.email
FROM email AS e
INNER JOIN
( SELECT tr.email_id
FROM email_trigram AS tr
WHERE tr.trigram IN ("n.p",".pi","pie","ier","err","rre")
GROUP BY tr.email_id HAVING count(*) =6) AS ti ON ti.email_id=e.id;

*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: <derived2>
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 8214
     filtered: 100.00
        Extra: NULL
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: e
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: ti.email_id
         rows: 1
     filtered: 100.00
        Extra: NULL
*************************** 3. row ***************************
           id: 2
  select_type: DERIVED
        table: tr
   partitions: NULL
         type: range
possible_keys: idx_email_id,idx_trigram
          key: idx_trigram
      key_len: 9
          ref: NULL
         rows: 8214
     filtered: 100.00
        Extra: Using index condition; Using temporary; Using filesort
3 rows in set, 1 warning (0.00 sec)

SELECT e.email

FROM email AS e

INNER JOIN

( SELECT tr.email_id

FROM email_trigram AS tr

WHERE tr.trigram IN ("n.p",".pi","pie","ier","err","rre")

GROUP BY tr.email_id HAVING count(*) =6) AS ti ON ti.email_id=e.id;

*************************** 1. row ***************************

id: 1

select_type: PRIMARY

table: <derived2>

partitions: NULL

type: ALL

possible_keys: NULL

key: NULL

key_len: NULL

ref: NULL

rows: 8214

filtered: 100.00

Extra: NULL

*************************** 2. row ***************************

id: 1

select_type: PRIMARY

table: e

partitions: NULL

type: eq_ref

possible_keys: PRIMARY

key: PRIMARY

key_len: 4

ref: ti.email_id

rows: 1

filtered: 100.00

Extra: NULL

*************************** 3. row ***************************

id: 2

select_type: DERIVED

table: tr

partitions: NULL

type: range

possible_keys: idx_email_id,idx_trigram

key: idx_trigram

key_len: 9

ref: NULL

rows: 8214

filtered: 100.00

Extra: Using index condition; Using temporary; Using filesort

3 rows in set, 1 warning (0.00 sec)

It does not have to read the whole table, but still needs to read a lot of rows and even using filesort. I did not want to create trigrams manually, so I wrote the following procedure:

create procedure select_email(IN email_address varchar(255))
begin

DECLARE email_length int;
declare x int ;
declare i int ;
declare trigram varchar(255) ;
declare trigram_list varchar(255) ;

SELECT CHAR_LENGTH(SUBSTRING_INDEX(email_address,'@',1)) into email_length;

set i=1;
set x=3;
	while email_length >= 3 do

	select substring(SUBSTRING_INDEX(email_address,'@',1) from i for x) into trigram;
	set trigram_list=concat_ws('","',trigram_list,trigram);

	set email_length=email_length-1;
	set i=i+1;
	end while ;

set trigram_list=concat('"',trigram_list,'"');

set @final= concat( " SELECT e.email
FROM email AS e
INNER JOIN
( SELECT tr.email_id
	FROM email_trigram AS tr
	WHERE tr.trigram IN (" , trigram_list , ")
	GROUP BY tr.email_id HAVING count(*) =" , i-1 , ") AS ti ON ti.email_id=e.id" );

PREPARE stmt FROM @final;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;

end

create procedure select_email(IN email_address varchar(255))

begin

DECLARE email_length int;

declare x int ;

declare i int ;

declare trigram varchar(255) ;

declare trigram_list varchar(255) ;

SELECT CHAR_LENGTH(SUBSTRING_INDEX(email_address,'@',1)) into email_length;

set i=1;

set x=3;

while email_length >= 3 do

select substring(SUBSTRING_INDEX(email_address,'@',1) from i for x) into trigram;

set trigram_list=concat_ws('","',trigram_list,trigram);

set email_length=email_length-1;

set i=i+1;

end while ;

set trigram_list=concat('"',trigram_list,'"');

set @final= concat( " SELECT e.email

FROM email AS e

INNER JOIN

( SELECT tr.email_id

FROM email_trigram AS tr

WHERE tr.trigram IN (" , trigram_list , ")

GROUP BY tr.email_id HAVING count(*) =" , i-1 , ") AS ti ON ti.email_id=e.id" );

PREPARE stmt FROM @final;

EXECUTE stmt;

DEALLOCATE PREPARE stmt;

end

Since with trigrams we are looking for parts of the words (like err or ier), there can be many matches. If we are using a longer condition like derson.pierre, the procedure needed to read 65722 rows. This is also a lot.

Let’s have a look at the selectivity a bit:

mysql> select trigram,count(*) as c from email_trigram group by trigram order by c desc limit 10;
+---------+-------+
| trigram | c     |
+---------+-------+
| er.     | 12919 |
| ann     | 12726 |
| man     | 12462 |
| ell     | 12257 |
| ber     | 12225 |
| sch     | 11753 |
| son     | 10983 |
| ill     |  9443 |
| mar     |  9294 |
| ert     |  8839 |
+---------+-------+
10 rows in set (1.59 sec)

mysql> select trigram,count(*) as c from email_trigram group by trigram order by c desc limit 10;

+---------+-------+

| trigram | c |

+---------+-------+

| er. | 12919 |

| ann | 12726 |

| man | 12462 |

| ell | 12257 |

| ber | 12225 |

| sch | 11753 |

| son | 10983 |

| ill | 9443 |

| mar | 9294 |

| ert | 8839 |

+---------+-------+

10 rows in set (1.59 sec)

There are parts that give back many rows. As I mentioned, more parts mean more rows.

I was hoping for a bigger improvement, so I wondered what else we could do. MySQL cannot use an index because of the leading %. How can we avoid that? Let’s save all the possible versions of the email address that we could be looking for.

(I don’t know if there is any official name of this method — if someone knows it, please write a comment.)

Shorting method

anderson.pierre@example.org:
anderson.pierre
nderson.pierre
derson.pierre
erson.pierre
rson.pierre
son.pierre
on.pierre
n.pierre
.pierre
pierre
ierre
erre
rre

anderson.pierre@example.org:

anderson.pierre

nderson.pierre

derson.pierre

erson.pierre

rson.pierre

son.pierre

on.pierre

n.pierre

.pierre

pierre

ierre

erre

rre

Hmm.. could this work? Let’s test it. I created the following table and trigger:

CREATE TABLE `email_tib` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `email_id` int(10) unsigned NOT NULL,
  `email_parts` varchar(120) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `idx_email_id` (`email_id`),
  KEY `idx_trigram` (`email_parts`)
) ENGINE=InnoDB AUTO_INCREMENT=2749001 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

create trigger insert_tib after insert ON test.email FOR EACH ROW 
BEGIN

DECLARE email_length int;
declare x int ;
declare i int ;
SELECT CHAR_LENGTH(SUBSTRING_INDEX(email,'@',1)) into email_length from test.email where email=NEW.email and id=NEW.id;

set i=-3;
	while email_length >= 3 do

	INSERT INTO email_tib (email_id,email_parts) values (NEW.id,substring((SUBSTRING_INDEX(NEW.email,'@',1)),i));
	set email_length=email_length-1;
	set i=i-1;
	end while ;

END

CREATE TABLE `email_tib` (

`id` int(10) unsigned NOT NULL AUTO_INCREMENT,

`email_id` int(10) unsigned NOT NULL,

`email_parts` varchar(120) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

PRIMARY KEY (`id`),

KEY `idx_email_id` (`email_id`),

KEY `idx_trigram` (`email_parts`)

) ENGINE=InnoDB AUTO_INCREMENT=2749001 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

create trigger insert_tib after insert ON test.email FOR EACH ROW

BEGIN

DECLARE email_length int;

declare x int ;

declare i int ;

SELECT CHAR_LENGTH(SUBSTRING_INDEX(email,'@',1)) into email_length from test.email where email=NEW.email and id=NEW.id;

set i=-3;

while email_length >= 3 do

INSERT INTO email_tib (email_id,email_parts) values (NEW.id,substring((SUBSTRING_INDEX(NEW.email,'@',1)),i));

set email_length=email_length-1;

set i=i-1;

end while ;

END

Let’s find the email addresses that contain n.pierre:

mysql> SELECT e.email
FROM email AS e
RIGHT JOIN email_tib AS t ON t.email_id=e.id
WHERE t.email_parts LIKE "n.pierre%";
+-------------------------------+
| email                         |
+-------------------------------+
| anderson.pierre@example.org   |
| friesen.pierre@example.net    |
| green.pierre@example.net      |
| hudson.pierre@example.com     |
| johnson.pierre@example.org    |
| larkin.pierre@example.net     |
| monahan.pierre@example.net    |
| olson.pierre@example.org      |
| rohan.pierre@example.com      |
| wilkinson.pierre@example.org  |
| williamson.pierre@example.net |
+-------------------------------+
11 rows in set (0.00 sec)

mysql> show session status like '%handler%';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 4     |
| Handler_mrr_init           | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 12    |
| Handler_read_last          | 0     |
| Handler_read_next          | 11    |
| Handler_read_prev          | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_next      | 0     |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+
18 rows in set (0.00 sec)

mysql> SELECT e.email

FROM email AS e

RIGHT JOIN email_tib AS t ON t.email_id=e.id

WHERE t.email_parts LIKE "n.pierre%";

+-------------------------------+

| email |

+-------------------------------+

| anderson.pierre@example.org |

| friesen.pierre@example.net |

| green.pierre@example.net |

| hudson.pierre@example.com |

| johnson.pierre@example.org |

| larkin.pierre@example.net |

| monahan.pierre@example.net |

| olson.pierre@example.org |

| rohan.pierre@example.com |

| wilkinson.pierre@example.org |

| williamson.pierre@example.net |

+-------------------------------+

11 rows in set (0.00 sec)

mysql> show session status like '%handler%';

+----------------------------+-------+

| Variable_name | Value |

+----------------------------+-------+

| Handler_commit | 1 |

| Handler_delete | 0 |

| Handler_discover | 0 |

| Handler_external_lock | 4 |

| Handler_mrr_init | 0 |

| Handler_prepare | 0 |

| Handler_read_first | 0 |

| Handler_read_key | 12 |

| Handler_read_last | 0 |

| Handler_read_next | 11 |

| Handler_read_prev | 0 |

| Handler_read_rnd | 0 |

| Handler_read_rnd_next | 0 |

| Handler_rollback | 0 |

| Handler_savepoint | 0 |

| Handler_savepoint_rollback | 0 |

| Handler_update | 0 |

| Handler_write | 0 |

+----------------------------+-------+

18 rows in set (0.00 sec)

Wow, that is much better than the previous one! It is more than 100 times faster! Now you can have a beer because you deserve it. 🙂

Selectivity

mysql> select email_parts,count(*) as c 
from email_tib group by email_parts 
order by c desc limit 10;
+-------------+------+
| email_parts | c    |
+-------------+------+
| son         | 6228 |
| ler         | 3863 |
| ner         | 3635 |
| ann         | 3590 |
| mann        | 3464 |
| man         | 3387 |
| ski         | 3331 |
| ton         | 2659 |
| ell         | 2482 |
| ger         | 2437 |
+-------------+------+
10 rows in set (1.61 sec)

mysql> select email_parts,count(*) as c

from email_tib group by email_parts

order by c desc limit 10;

+-------------+------+

| email_parts | c |

+-------------+------+

| son | 6228 |

| ler | 3863 |

| ner | 3635 |

| ann | 3590 |

| mann | 3464 |

| man | 3387 |

| ski | 3331 |

| ton | 2659 |

| ell | 2482 |

| ger | 2437 |

+-------------+------+

10 rows in set (1.61 sec)

There are parts that result in many readings as well, but it helps a lot now that we are using a longer pattern:

mysql> select email_parts,count(*) as c 
from email_tib where length(email_parts) > 5 
group by email_parts order by c desc limit 10;
+-------------+-----+
| email_parts | c   |
+-------------+-----+
| reilly      | 704 |
| walter      | 684 |
| kowski      | 676 |
| onnelly     | 662 |
| nnelly      | 662 |
| kinson      | 641 |
| tenberg     | 626 |
| enberg      | 626 |
| hermann     | 417 |
| ermann      | 417 |
+-------------+-----+
10 rows in set (1.59 sec)

mysql> select email_parts,count(*) as c

from email_tib where length(email_parts) > 5

group by email_parts order by c desc limit 10;

+-------------+-----+

| email_parts | c |

+-------------+-----+

| reilly | 704 |

| walter | 684 |

| kowski | 676 |

| onnelly | 662 |

| nnelly | 662 |

| kinson | 641 |

| tenberg | 626 |

| enberg | 626 |

| hermann | 417 |

| ermann | 417 |

+-------------+-----+

10 rows in set (1.59 sec)

Using more than six characters gives us a much better selectivity.

Table statistics

select count(*) from email;
+----------+
| count(*) |
+----------+
|   318458 |
+----------+
1 row in set (0.04 sec)

select count(*) from email_trigram;
+----------+
| count(*) |
+----------+
|  2749000 |
+----------+
1 row in set (0.44 sec)

select count(*) from email_tib;
+----------+
| count(*) |
+----------+
|  2749000 |
+----------+
1 row in set (0.45 sec)

select count(*) from email;

+----------+

| count(*) |

+----------+

| 318458 |

+----------+

1 row in set (0.04 sec)

select count(*) from email_trigram;

+----------+

| count(*) |

+----------+

| 2749000 |

+----------+

1 row in set (0.44 sec)

select count(*) from email_tib;

+----------+

| count(*) |

+----------+

| 2749000 |

+----------+

1 row in set (0.45 sec)

In this test, I used 318458 random email addresses, and both methods created 2749000 additional rows.

Size on the disk:

45M	email.ibd
221M	email_tib.ibd
189M	email_trigram.ibd

45M email.ibd

221M email_tib.ibd

189M email_trigram.ibd

As we expected they will use more space than the original table.

Cons

- Both solutions require an extra table

- That table contains millions of short rows, and it could use a few gigs of space

- Requires three triggers (insert, update and delete, which could affect the write performance on the table) or the application has to keep that table up-to-date

Pros

- Finding an email address is going to be much faster and require fewer reads.

- Users will be much more satisfied.

Conclusion

If there is no built in solution or index in MySQL that can help or solve your problem, do not give up. Many times, with small modifications, you can create your own index table or use other tricks.

In this specific case, if you are willing to sacrifice some additional disk space you can speed up your queries a lot with the right method. Trigram was not the best option, but I can see use cases where it could be better.

This is just one possible solution, there could be an even better one. If you know one, please let me know.

0 0 votes

Article Rating

12 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Scott Klasing

8 years ago

With triggers you wont be able to alter the table live using pt-online-schema-change and most alters can not be done ALTER ONLINE due to the need to copy data.

In super high volume shops the triggers would cause performance issues, and due to the need to alter these tables while business is running we disallow them.

Even with no triggers, once PTOSC is running , it adds its own triggers and the resultant trigger REPLACE statements also often cause our “super high volume” databases to jam, ie, lock up where threads go through the roof.

Author

Tibor Korocz

8 years ago

Reply to Scott Klasing

Hi Scott,

Yes, I also highlighted in the blog post triggers are required and that could affect the performance:

Requires three triggers (insert, update and delete, which could affect the write performance on the table) or the application has to keep that table up-to-date

But you are not right with pt-online-schema-change, with MySQL 5.7 it can work even if there are other triggers on the table: https://www.percona.com/doc/percona-toolkit/3.0/pt-online-schema-change.html#cmdoption-pt-online-schema-change-preserve-triggers

Of course if you have a “super high volume” workload you might need some other solution, if you have a workload where you can not even run pt-online-schema-change you might can use gh-ost (https://github.com/github/gh-ost).

Jannes

8 years ago

I’ve been doing exactly this for the last few years to get fast search over many tens of millions of rows, over many fields. Few extra tricks:

1) You can limit the length of your index to for example 6 characters if you alter your search query like so:

SELECT e.email
FROM email AS e
RIGHT JOIN email_tib AS t ON t.email_id=e.id
WHERE t.email_parts LIKE “n.pier%” — Note that this patter has been cut to max 6 chars as well
AND e.email LIKE ‘%n.pierre%’; — Note full user specified pattern here

So you do the fast match on only a few characters and then double check the original field against the original pattern to get rid of false positives. This cuts back immensely on the size of your index table, without significantly affecting the selectivity (if not too short). Fixed width char fields also save a bit of overhead and are slightly faster.

2) The above also allows you to get rid of the delete trigger and speed up the update trigger. A few stale rows in the index table are no problem, you can just garbage collect them later.

3) Suppose you have more fields than just an email address, like firstname, lastname etc. then you can throw all those permutations into the same 6 char index table and use a query like this:

SELECT e.email, e.firstname, e.lastname
FROM email AS e
RIGHT JOIN email_tib AS t ON t.email_id=e.id
WHERE t.email_parts LIKE “pierre%”
AND (e.email LIKE ‘%pierre%’ OR e.firstname LIKE ‘%pierre%’ OR e.lastname LIKE ‘%pierre%’);

4) Bonus points for avoiding duplicates in the index table pointing to the same record: covering UNIQUE key and use INSERT IGNORE.

5) Realistically, if the email is “anderson.pierre”, who is ever going to do a search for ‘%n.pierre%’ ? You could limit your index to only contain the first 6 characters after each word boundary. So only do ‘anders’, ‘.pierr’, ‘pierre’. Other domain specific knowledge might lead you to exclude ‘.com’ and ‘.net’ or even whole domains ‘gmail.com’ from your index. They’d give so many matches anyway that it’s pointless; might as well not include them at all (in the index table).

6) You could allow the user to enter 2 search patterns, then simply match both from the index (t.email_parts LIKE “pierre%” OR t.email_parts LIKE “exampl%”) and then again match both against all actual fields. Even if you cut out domain names from the index, it will still only return [email protected] and not [email protected].

7) I don’t actually know if it’s faster or not, but there is a way to make one multirow insert in the trigger instead of looping multiple single row inserts.

Hope this helps someone.

Author

Tibor Korocz

8 years ago

Reply to Jannes

Hi Jannes,

These are very good comments.
As you can see in my triggers I am not using the domain part at all because that would give us much more matches.

I was also thinking to remove the duplicates completely. Every part would be have an ID like anders has id 5 and .pierr has ID 102 etc..

So a table like this:
id,part_name
1,mike
2,john
…
5,andres
6,andre
..
102,.pierr
…

And I I would have an other table , a kind of connection table which would contain all the email address IDs and the these parts IDs, so I could easily select which email address has these parts..

That table would look like something like this:

email_id, parts_id
1,5
1,8
1,99
2,101
2,5
2,506
…

I did not test this yet.
But this could save us a lot of space and it would might be faster as well.. I am planning to test it but first I wanted to see if there are any comments or other IDs as well.

Thank you for your comments.

Jannes

8 years ago

Thanks. I noticed you already did some of the things I mentioned. Just wanted to point them out again, I guess.

I’d be happy to be proven wrong, but I doubt your extra table is going to help you. I vaguely remember exploring that direction a little, but giving up. Size wise 6 characters is basically the same size as an integer. So all the extra integers referencing each other will only make the total (much) larger and require more random access lookups when joining. I’m even using only 4 in my application, which in my case is still fine for selectivity. In my experience, even if the initial index matches way too many rows, the other AND clause will quickly filter the false positives out.

In my case I’m also adding partitions and a Priority field, for roughly ordering the results by relevance as well as excluding some partitions in some cases.

CREATE TABLE textindex (
ID int(10) unsigned NOT NULL,
Subtext char(4) NOT NULL COMMENT ‘Note: case insensitive!’,
Prio tinyint(3) unsigned NOT NULL,
PRIMARY KEY (Subtext,ID,Prio) # Covering index, so no lookup for actual data required
) ENGINE=InnoDB COMMENT=’Contains all text fragments from different fields and tables’
PARTITION BY LIST (Prio)
(PARTITION p05 VALUES IN (5) COMMENT = ‘One letter words’,
PARTITION p06 VALUES IN (6) COMMENT = ‘Two letter words’,
PARTITION p07 VALUES IN (7) COMMENT = ‘Recently used items (of all types) (needs a scheduled update)’,
PARTITION p10 VALUES IN (10) COMMENT = ‘Standard prio’,
PARTITION p90 VALUES IN (90) COMMENT = ‘Old expired stuff’
);

I have something like 8 fields where I pull subtexts from. In my case a subtext always starts at a word boundary in the original string. So ‘bla123bla.test’ becomes: { ‘bla1’, ‘123b’, ‘bla.’, ‘.tes’, ‘test’ }. For wildcard matching I require the user to type at least 3 characters. If he types a 1 or 2 character word, I make sure it’s an exact match in only the corresponding partition (excluding others for speedup). It will depend on your application whether this makes sense for you.

Another small set of duplicates you can get rid of (at least in my case) is where one Subtext is the prefix of another. If you have ‘bla’ and ‘blah’ you can get rid of the first one. Probably not very effective for a list of only email addresses, but a nice boost when you’re getting texts from various, partially similar fields.

Another idea I’m playing with which might actually be useful in your case (but harder for me) is making things work even if the user searches for ‘npierre’ (without dot). Besides your normal inserts, similarly also insert all Subtexts of the string with all non alphanumeric characters filtered out. Then use a search query like this:

SELECT DISTINCT e.id, e.email, e.firstname, e.lastname
FROM email AS e
RIGHT JOIN email_tib AS t ON t.email_id=e.id
WHERE t.email_parts LIKE “npierr%”
AND (e.email LIKE ‘%npierre%’ OR e.firstname LIKE ‘%npierre%’ OR e.lastname LIKE ‘%npierre%’
OR REPLACE(e.email,’.’,”) LIKE ‘%npierre%’ OR REPLACE(e.firstname,’.’,”) LIKE ‘%npierre%’ OR REPLACE(e.lastname,’.’,”) LIKE ‘%npierre%’
);

Note that I’m only filtering the dot here, actually filtering all alphanumeric characters efficiently is surprisingly hard in MySQL. Anyway, I found the idea of simply stuffing imaginary data (even typos) into your index table can actually be useful, a bit funny.

Speaking of hard, here’s how I decide where word boundaries are (letting pos go from 1 to the length of the string) :

AND (pos = 1
OR (LOCATE(LOWER(SUBSTRING(i.Description, pos-1, 1)), ‘abcdefghijklmnopqrstuvwxyz012345678901234567890123456 ‘) + 25) DIV 26
(LOCATE(LOWER(SUBSTRING(i.Description, pos , 1)), ‘abcdefghijklmnopqrstuvwxyz012345678901234567890123456 ‘) + 25) DIV 26
)

Quite horrible and unfortunately not very friendly to more exotic languages.

One final little trick I do, as I have company names in one of the fields, is removing words like ‘Ltd.’ and ‘Inc.’. ‘Incorporated’ and other words that are too common. This is sort of the reverse of the previous trick: removing some useless things from the index. If the user types ‘google inc’ it will find google quickly through the index without wasting time on a million false matches had ‘inc’ been in there. It will still require matches on both ‘%google%’ and ‘%inc%’ in the final pattern.

One small correction in item 3 of my previous post: that query needs a DISTINCT, as it can actually get multiple matches from the index pointing to the same email row.

Interesting discussion, thanks.

Jannes

8 years ago

Re-reading your initial trigram query, can you try the following on your dataset?

SELECT DISTINCT e.id, e.email
FROM email_trigram AS tr
INNER JOIN email AS e ON e.id=tr.email_id
WHERE tr.trigram = “n.p” — only first 3 chars of search term
AND e.email LIKE ‘%n.pierr%” — whole search term
;

Avoiding GROUP BY has the added benefit of allowing you to use a LIMIT (that fills quickly) and then telling the user to improve her search terms.

Another fun one I just came up with while typing this:

SELECT e.email
FROM email AS e
INNER JOIN email_trigram AS tr1 ON tr1.email_id=e.id AND tr1.trigram=”n.p”
INNER JOIN email_trigram AS tr2 ON tr2.email_id=e.id AND tr2.trigram=”.pi”
INNER JOIN email_trigram AS tr3 ON tr3.email_id=e.id AND tr3.trigram=”pie”
INNER JOIN email_trigram AS tr4 ON tr4.email_id=e.id AND tr4.trigram=”ier”
INNER JOIN email_trigram AS tr5 ON tr5.email_id=e.id AND tr5.trigram=”err”
INNER JOIN email_trigram AS tr6 ON tr6.email_id=e.id AND tr6.trigram=”rre”
;

For this to be fast you should probably do:

ADD UNIQUE KEY idx_trigram_email_id (trigram, email_id)

(or probably better: drop the id field and make the above the PRIMARY KEY)

If that works even remotely decently, you might want to consider also trying 4 and 5 character quadgrams (?) and pentagrams (?)

I haven’t tested that one at all, but I wouldn’t be surprised if that actually has very good performance. I might need to experiment with that for my own data set. Please let me know what you find.

Further idea: combine the two above queries only doing 2 or 3 INNER JOINs (the first and the last for best selectivity) and filtering the rest like in the first query. (no DISTINCT required if you have the UNIQUE KEY)

Masood Alam

8 years ago

This particular email scenario would make more sense in Redis to me instead of maintaining triggers and/or application work around.

Editor

Alexey Kopytov

8 years ago

Tibor,

Actually, there is a built-in solution in MySQL to do things like that, and it’s called the ngram plugin: https://dev.mysql.com/doc/refman/5.7/en/fulltext-search-ngram.html

It is, in my opinion, one of the least known and most underrated features in MySQL. Probably because the fine manual leaves the impression that ngram is only applicable to the CJK languages. It is not, you should certainly give it a try.

Tibor Korocz

8 years ago

Reply to Alexey Kopytov

Hi,

I already tested and wrote a blog post about that, it is in the queue, coming soon 😉

Bill Karwin

8 years ago

I did a presentation some years ago comparing fulltext search methods. I tried trigrams, but in my results, it was about 100 times slower than using a real FULLTEXT index, and about 1000 times slower than Sphinx Search.

Trigrams are better than a table-scan of course, but I would not recommend trigrams.

https://www.slideshare.net/billkarwin/practical-full-text-search-with-my-sql

Author

Tibor Korocz

8 years ago

Reply to Bill Karwin

Hi,

Yes, you are right Sphinx and similar solutions are going to be faster than trigrams for sure. No doubts there. 🙂

This solution can work in scenarios where you do not want to add another layer another complexity to you infrastructure or you have limited resources etc.. Honestly there are many businesses with only 1-2 MySQL servers.. But they still want reasonable response time, in those cases it might can be useful.

But yes I would not recommended it in every scenario as well, but I still can see some cases where it can work.