Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Understanding Hash Joins in MySQL 8

October 30, 2019

Author

Tibor Korocz

MySQL

Percona Software

Share this Post:

hash joins mysql In MySQL 8.0.18 there is a new feature called Hash Joins, and I wanted to see how it works and in which situations it can help us. Here you can find a nice detailed explanation about how it works under the hood.

The high-level basics are the following: if there is a join, it will create an in-memory hash table based on one of the tables and will read the other table row by row, calculate a hash, and do a lookup on the in-memory hash table.

Great, but does this give us any performance benefits?

First of all, this only works on fields that are not indexed, so that is an immediate table scan and we usually do not recommend doing joins without indexes because it is slow. Here is where Hash Joins in MySQL can help because it will use an in-memory hash table instead of Nested Loop.

Let’s do some tests and see. First I created the following tables:

CREATE TABLE `t1` (
`id` int(11) NOT NULL AUTO_INCREMENT ,
`c1` int(11) NOT NULL DEFAULT '0',
`c2` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
KEY `idx_c1` (`c1`)
) ENGINE=InnoDB;

CREATE TABLE `t2` (
`id` int(11) NOT NULL AUTO_INCREMENT ,
`c1` int(11) NOT NULL DEFAULT '0',
`c2` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
KEY `idx_c1` (`c1`)
) ENGINE=InnoDB;

CREATE TABLE `t1` (

`id` int(11) NOT NULL AUTO_INCREMENT ,

`c1` int(11) NOT NULL DEFAULT '0',

`c2` int(11) NOT NULL DEFAULT '0',

PRIMARY KEY (`id`),

KEY `idx_c1` (`c1`)

) ENGINE=InnoDB;

CREATE TABLE `t2` (

`id` int(11) NOT NULL AUTO_INCREMENT ,

`c1` int(11) NOT NULL DEFAULT '0',

`c2` int(11) NOT NULL DEFAULT '0',

PRIMARY KEY (`id`),

KEY `idx_c1` (`c1`)

) ENGINE=InnoDB;

I have inserted 131072 random rows into both tables.

mysql> select count(*) from t1;
+----------+
| count(*) |
+----------+
| 131072   |
+----------+

mysql> select count(*) from t1;

+----------+

| count(*) |

+----------+

| 131072 |

+----------+

First test – Hash Joins

Run a join based on c2 which is not indexed.

mysql> explain format=tree select count(*) from t1 join t2 on t1.c2 = t2.c2G
*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(0)
-> Inner hash join (t2.c2 = t1.c2) (cost=1728502115.04 rows=1728488704)
-> Table scan on t2 (cost=0.01 rows=131472)
-> Hash
-> Table scan on t1 (cost=13219.45 rows=131472)

1 row in set (0.00 sec)

mysql> explain format=tree select count(*) from t1 join t2 on t1.c2 = t2.c2G

*************************** 1. row ***************************

EXPLAIN: -> Aggregate: count(0)

-> Inner hash join (t2.c2 = t1.c2) (cost=1728502115.04 rows=1728488704)

-> Table scan on t2 (cost=0.01 rows=131472)

-> Hash

-> Table scan on t1 (cost=13219.45 rows=131472)

1 row in set (0.00 sec)

We have to use explain format=tree to see if Hash Join will be used or not, as normal explain still says it is going to be a Nested Loop, which I think it is very misleading. I have already filed a bug report because of this and in the ticket, you can see some comments from developers saying:

The solution is to stop using traditional EXPLAIN (it will eventually go away).

So this is not going to be fixed in traditional explain and we should start using the new way.

Back to the query; we can see it is going to use Hash Join for this query, but how fast is it?

mysql> select count(*) from t1 join t2 on t1.c2 = t2.c2;
+----------+
| count(*) |
+----------+
| 17172231 |
+----------+
1 row in set (0.73 sec)

mysql> select count(*) from t1 join t2 on t1.c2 = t2.c2;

+----------+

| count(*) |

+----------+

| 17172231 |

+----------+

1 row in set (0.73 sec)

0.73s for a more than 17m rows join table. Looks promising.

Second Test – Non-Hash Joins

We can disable it with an optimizer switch or optimizer hint.

mysql> select /*+ NO_HASH_JOIN (t1,t2) */ count(*) from t1 join t2 on t1.c2 = t2.c2;
+----------+
| count(*) |
+----------+
| 17172231 |
+----------+
1 row in set (13 min 36.36 sec)

mysql> select /*+ NO_HASH_JOIN (t1,t2) */ count(*) from t1 join t2 on t1.c2 = t2.c2;

+----------+

| count(*) |

+----------+

| 17172231 |

+----------+

1 row in set (13 min 36.36 sec)

Now the same query takes more than 13 minutes. That is a huge difference and we can see Hash Join helps a lot here.

Third Test – Joins Based on Indexes

Let’s create indexes and see how fast a join based on indexes is.

create index idx_c2 on t1(c2);
create index idx_c2 on t2(c2);

mysql> select count(*) from t1 join t2 on t1.c2 = t2.c2;
+----------+
| count(*) |
+----------+
| 17172231 |
+----------+
1 row in set (2.63 sec)

create index idx_c2 on t1(c2);

create index idx_c2 on t2(c2);

mysql> select count(*) from t1 join t2 on t1.c2 = t2.c2;

+----------+

| count(*) |

+----------+

| 17172231 |

+----------+

1 row in set (2.63 sec)

2.6s Hash Join is even faster than the Index-based join in this case.

However, I was able to force the optimizer to use Hash Joins even if an index is available by using ignore index:

mysql> explain format=tree select count(*) from t1 ignore index (idx_c2) join t2 ignore index (idx_c2) on t1.c2 = t2.c2 where t1.c2=t2.c2G
*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(0)
-> Inner hash join (t2.c2 = t1.c2) (cost=1728502115.04 rows=17336898)
-> Table scan on t2 (cost=0.00 rows=131472)
-> Hash
-> Table scan on t1 (cost=13219.45 rows=131472)

1 row in set (0.00 sec)

mysql> explain format=tree select count(*) from t1 ignore index (idx_c2) join t2 ignore index (idx_c2) on t1.c2 = t2.c2 where t1.c2=t2.c2G

*************************** 1. row ***************************

EXPLAIN: -> Aggregate: count(0)

-> Inner hash join (t2.c2 = t1.c2) (cost=1728502115.04 rows=17336898)

-> Table scan on t2 (cost=0.00 rows=131472)

-> Hash

-> Table scan on t1 (cost=13219.45 rows=131472)

1 row in set (0.00 sec)

I still think it would be nice if I can tell the optimizer with a hint to use Hash Joins even if an index is available, so we do not have to ignore indexes on all the tables. I have created a feature request for this.

However, if you read my first bug report carefully you can see a comment from a MySQL developer which indicates this might not be necessary:

BNL (Block Nested-Loop) will also go away entirely at some point, at which point this hint will be ignored.

That could mean they are planning to remove BNL joins in the future and maybe replace it with Hash join.

Limitations

We can see Hash Join can be powerful, but there are limitations:

- As I mentioned it only works on columns that do not have indexes (or you have to ignore them).

- It only works with equi-join conditions.

- It does not work with LEFT or RIGHT JOIN.

I would like to see a status metric as well to monitor how many times Hash Join was used, and for this, I filled another feature request.

Conclusion

Hash Join seems a very powerful new join option, and we should keep an eye on this because I would not be surprised if we get some other features in the future as well. In theory, it would be able to do Left and Right joins as well and as we can see in the comments on the bug report that Oracle has plans for it in the future.

0 0 votes

Article Rating

8 Comments

Oldest

Newest Most Voted

Jacob Jo

6 years ago

Hi,
First of all, thanks for this post.
One quick question, Is there any distinctive different points with Hash Join of MySQL 8.0.18 in the perspective of performance or hash join algorithm?
Especially when Comparing MySQL’s to MariaDB’s ?
Thanks in advance.

https://www.percona.com/blog/2012/05/31/a-case-for-mariadbs-hash-joins/

Author

Tibor Korocz

6 years ago

Reply to Jacob Jo

Hi Jacob,

There are some differences between MariaDB and Oracle’s solution but I did not do a performance benchmark, however I am planning to do one soon. 🙂

Erik Frøseth

6 years ago

Reply to Jacob Jo

Hello Jacob,

One of the main differences between hash join in MariaDB and MySQL is that MySQL will spill to disk if the input becomes too large to fit in memory. This is usually a lot more efficient than doing a re-fill of the hash table and scanning the probe input multiple times (which basically makes hash join become very much like block-nested loop).