Understanding Hash Joins in MySQL 8

hash joins mysqlIn MySQL 8.0.18 there is a new feature called Hash Joins, and I wanted to see how it works and in which situations it can help us. Here you can find a nice detailed explanation about how it works under the hood.

The high-level basics are the following: if there is a join, it will create an in-memory hash table based on one of the tables and will read the other table row by row, calculate a hash, and do a lookup on the in-memory hash table.

Great, but does this give us any performance benefits?

First of all, this only works on fields that are not indexed, so that is an immediate table scan and we usually do not recommend doing joins without indexes because it is slow. Here is where Hash Joins in MySQL can help because it will use an in-memory hash table instead of Nested Loop.

Let’s do some tests and see. First I created the following tables:

I have inserted 131072 random rows into both tables.

First test – Hash Joins

Run a join based on c2 which is not indexed.

We have to use explain format=tree to see if Hash Join will be used or not, as normal explain still says it is going to be a Nested Loop, which I think it is very misleading. I have already filed a bug report because of this and in the ticket, you can see some comments from developers saying:

The solution is to stop using traditional EXPLAIN (it will eventually go away).

So this is not going to be fixed in traditional explain and we should start using the new way.

Back to the query; we can see it is going to use Hash Join for this query, but how fast is it?

0.73s for a more than 17m rows join table. Looks promising.

Second Test – Non-Hash Joins

We can disable it with an optimizer switch or optimizer hint.

Now the same query takes more than 13 minutes. That is a huge difference and we can see Hash Join helps a lot here.

Third Test – Joins Based on Indexes

Let’s create indexes and see how fast a join based on indexes is.

2.6s  Hash Join is even faster than the Index-based join in this case.

However, I was able to force the optimizer to use Hash Joins even if an index is available by using ignore index:

I still think it would be nice if I can tell the optimizer with a hint to use Hash Joins even if an index is available, so we do not have to ignore indexes on all the tables. I have created a feature request for this.

However, if you read my first bug report carefully you can see a comment from a MySQL developer which indicates this might not be necessary:

BNL (Block Nested-Loop) will also go away entirely at some point, at which point this hint will be ignored.

That could mean they are planning to remove BNL joins in the future and maybe replace it with Hash join.

Limitations

We can see Hash Join can be powerful, but there are limitations:

  • As I mentioned it only works on columns that do not have indexes (or you have to ignore them).
  • It only works with equi-join conditions.
  • It does not work with LEFT or RIGHT JOIN.

I would like to see a status metric as well to monitor how many times Hash Join was used, and for this, I filled another feature request.

Conclusion

Hash Join seems a very powerful new join option, and we should keep an eye on this because I would not be surprised if we get some other features in the future as well. In theory, it would be able to do Left and Right joins as well and as we can see in the comments on the bug report that Oracle has plans for it in the future.

Share this post

Comments (7)

  • Jacob Jo Reply

    Hi,
    First of all, thanks for this post.
    One quick question, Is there any distinctive different points with Hash Join of MySQL 8.0.18 in the perspective of performance or hash join algorithm?
    Especially when Comparing MySQL’s to MariaDB’s ?
    Thanks in advance.

    https://www.percona.com/blog/2012/05/31/a-case-for-mariadbs-hash-joins/

    October 31, 2019 at 5:11 am
    • Tibor Korocz Reply

      Hi Jacob,

      There are some differences between MariaDB and Oracle’s solution but I did not do a performance benchmark, however I am planning to do one soon. 🙂

      October 31, 2019 at 10:25 am
      • Jacob Jo Reply

        Thanks, Looking forward to seeing the performance benchmark 🙂

        October 31, 2019 at 9:37 pm
    • Erik Frøseth Reply

      Hello Jacob,

      One of the main differences between hash join in MariaDB and MySQL is that MySQL will spill to disk if the input becomes too large to fit in memory. This is usually a lot more efficient than doing a re-fill of the hash table and scanning the probe input multiple times (which basically makes hash join become very much like block-nested loop).

      October 31, 2019 at 10:59 am
      • Jacob Jo Reply

        Hi Erik,
        Thanks for kind and detailed explain. It spills disk when the probe input is too large.

        October 31, 2019 at 9:43 pm
        • Erik Frøseth Reply

          In fact, it spills to disk when the _build_ input is too large. It is the build input we are trying to squeeze into memory.

          November 1, 2019 at 3:52 am
          • Jacob

            Thanks for clarification 🙂

            November 1, 2019 at 6:40 am

Leave a Reply