November 26, 2014

A case for MariaDB’s Hash Joins

MariaDB 5.3/5.5 has introduced a new join type “Hash Joins” which is an implementation of a Classic Block-based Hash Join Algorithm. In this post we will see what the Hash Join is, how it works and for what types of queries would it be the right choice. I will show the results of executing benchmarks for different queries and explain the results so that you have a better understanding of when using the Hash Join will be best and when not. Although Hash Joins are available since MariaDB 5.3, but I will be running my benchmarks on the newer MariaDB 5.5.

Overview

Hash Join is a new algorithm introduced in MariaDB 5.3/5.5 that can be used for joining tables that have a equijoin conditions of the form tbl1.col1 = tbl2.col1, etc. As I mentioned above that what is actually implemented is the Classic Hash Join. But its known as Block Nested Loop Hash (BNLH) Join in MariaDB.
The Classic Hash Join Algorithm consists of two phases, a build phase and a probe phase. Let’s consider the case of joining two tables on a equijoin condition. So the first thing would be to designate the smallest of the two tables as the left operand and the other table which is bigger, to be the right operand. Now when the algorithm begins, the first phase is the build phase, in which a hash table is created over the join attributes and rows of the left operand. Next comes the probe phase, which is where the matching rows from the right operand are found, by scanning the right operand and for each row scanned performing a lookup in the hash table by using values of the columns participating in the equijoin condition. The hash table is accessed by using a hash function on the values of the join condition, and hence is quite efficient. But what about the restriction on the size of the hash table. The size of the hash table is restricted by the value of join_buffer_size, and so if the left operand is big such that the size of the hash table built on it is greater than the join_buffer_size, then multiple hash tables would be created. For example if the left operand has “n” rows, and its size is three times the value of join_buffer_size, then 3 hash tables would need to be created each containing a hash table on n/3 rows. And so both the build and probe phase would be done three times, which means that the right operand will be scanned thrice.

Wikipedia has a nicely simplified version of the Classic Hash Join algorithm (http://en.wikipedia.org/wiki/Hash_join#Classic_hash_join) which I will quote below for better understanding:

  1. For each tuple r in the build input R
    1. Add to the in-memory hash table
    2. If the size of the hash table equals the maximum in-memory size:
      1. Scan the probe input S, and add matching join tuples to the output relation
      2. Reset the hash table
  2. Do a final scan of the probe input S and add the resulting join tuples to the output relation

Now after the explanation of the hash join lets see how it performs for different test cases.

Benchmarks

For the purpose of the benchmarks I used the DBT3 dataset of scale factor 2, which means the total dataset size is 4.8G. Let me show the breakdown of dataset size by the tables that I have used in the benchmarks:

Table ‘lineitem': 3.8G
Table ‘supplier': 11M
Table ‘orders': 468M

I have benchmarked two different kinds of workloads, IO bound and in-memory. Benchmark on IO bound workload was performed with a buffer pool size of 1G, while benchmark on in-memory workload was performed with a buffer pool size of 6G. The benchmarks compare Block Nested Loop (BNL) Join of MySQL 5.5.24, Batched Key Access (BKA) Join of MySQL 5.6.5 and Block Nested Loop Hash (BNLH) Join of MariaDB 5.5.20. The configuration used with the three variants of MySQL are listed below.

Configuration

Let’s first take a look at the configuration used with different MySQL flavors.

MySQL 5.5.24 Configuration
innodb_file_per_table=1
innodb_file_format=barracuda
innodb_log_file_size=512M
innodb_log_files_in_group=2
innodb_flush_log_at_trx_commit=2
innodb_flush_method=O_DIRECT

query_cache_size=0
query_cache_type=0

MySQL 5.6.5 Configuration
innodb_file_per_table=1
innodb_file_format=barracuda
innodb_log_file_size=512M
innodb_log_files_in_group=2
innodb_flush_log_at_trx_commit=2
innodb_flush_method=O_DIRECT

query_cache_size=0
query_cache_type=0

optimizer_switch='index_condition_pushdown=on'
optimizer_switch='mrr=on'
optimizer_switch='mrr_cost_based=off'
read_rnd_buffer_size=32M
optimizer_switch='batched_key_access=on'
join_buffer_size=32M

MariaDB 5.5.20 Configuration
innodb_file_per_table=1
innodb_file_format=barracuda
innodb_log_file_size=512M
innodb_log_files_in_group=2
innodb_flush_log_at_trx_commit=2
innodb_flush_method=O_DIRECT

query_cache_size=0
query_cache_type=0

optimizer_switch='index_condition_pushdown=on'

optimizer_switch='mrr=on'
optimizer_switch='mrr_sort_keys=on'
optimizer_switch='mrr_cost_based=off'
mrr_buffer_size=32M

optimizer_switch='join_cache_incremental=on'
optimizer_switch='join_cache_hashed=on'
optimizer_switch='join_cache_bka=on'
join_cache_level=4
join_buffer_size=32M
join_buffer_space_limit=32M

Note that MariaDB includes a new variable ‘join_cache_level‘, this variable controls which Join Algorithms are allowed to be used, a value of 4 here means that Nested Loop Join and Hash Join algorithms are allowed. Now as well know that ‘join_buffer_size‘ controls the size of the join buffer allocated for each join in a query, MariaDB introduces another variable to control the size of the buffer ‘join_buffer_space_limit‘. This variable controls the maximum allowed size of the buffer for the whole query. By default it has a value of 1024*128*10, which means that your effective join_buffer_size is not bigger than this value. Hence, the reason I have set join_buffer_space_limit=32M.

Benchmark Machine Specs

The machine that I used for the benchmarks, is a dual core machine with the following CPU configuration: 2xIntel(R) Core(TM)2 CPU 6600 @ 2.40GHz. The amount of memory installed is 8G and the MySQL datadir is on a 4-disk Software RAID5 volume, the disks are 5.4K RPM disks. The filesystem used is XFS, and the OS installed is Centos 6.2

Table Structure

Before moving on, let’s take a look at the structure of the tables involved in the benchmark tests.

Test Cases

Now let’s see the test cases and then see how the joins perform for each test case.

Test Case A – Join a small table that fits in memory to a large table with no WHERE clause

The SQL used for this test together with its EXPLAIN output as returned by MySQL 5.5 is as follows:

And the results in seconds of time taken to complete the above query:

First thing to note is that I have scaled down the time taken by MySQL 5.5 to finish the query on IO bound workload so that it could fit well in the chart, in actuality the query took 32077 seconds to finish in the IO bound workload. Anyhow we can clearly see from the above chart that Hash join comprehensively beats BKA and BNL, hash join is perfect in these cases where you are joining a small table with a very large table with no ‘indexed where’ conditions on the big table. BNLH takes half the time to complete the query for in-memory workload and 6.6x less times as compared to BKA MySQL 5.6, and 965x less time as compared to BNL MySQL 5.5. So hash join gives us an improvement by a very large factor both for IO bound workload and in-memory workload.

Test Case B – Join a small table that fits in memory to a large table with a selective WHERE clause on an indexed column

The SQL used for this test together with its EXPLAIN output as returned by MySQL 5.5 is as follows:

And the results in seconds of time taken to complete the above query:

First thing to note is that I have scaled down the time taken by MySQL 5.5 to finish the query on IO bound workload so that it could fit well in the chart, in actuality the query took 1280 seconds to finish in the IO bound workload. In this test Hash join is not ideal because you have a highly selective where clause that reduces the size of the joining data set. And hash join performs even badly and takes 7x more time for in-memory workload in this test case. While for IO bound workload, hash join takes 53x less time to execute the query as compared to MySQL 5.5 but takes slightly more time as compared to BKA algorithm of MySQL 5.6.

Test Case C – Join a small table with a large table with a WHERE clause on a non-indexed column

The SQL used for this test together with its EXPLAIN output as returned by MySQL 5.5 is as follows:

And the results in seconds of time taken to complete the above query:

First thing to note is that I have scaled down the time taken by MySQL 5.5 to finish the query on IO bound workload so that it could fit well in the chart, in actuality the query took 31654 seconds to finish in the IO bound workload. Again here hash join beats BKA and BNL comprehensively. Hash join outperforms the other join types when you are joining a small table with a very large table with a where clause on a ‘non-indexed’ column In this test we can clearly see that Hash Join gives a lot of reduction in query time. The reduction in query time for IO bound workload is 1266x times when compared to MySQL 5.5 and 9x times when compared to MySQL 5.6. While for in-memory workload the reduction is query time is 3.5x when compared to both MySQL 5.5 and MySQL 5.6.

Another interesting thing to note is that for both Test B and Test C, Hash Join takes similar amount of time both for IO bound workload and in-memory workload. Why, because Hash Join implies scanning the table lineitem (right operand) in both test cases. Since in Test B we have a limited set of rows in the supplier table (left operand) to join to the lineitem table (right operand) so scanning the lineitem table (BNLH) proves to be costly as compared to doing batched index lookups (BKA). However, in Test C the cost of hash join remains the same but the cost of BKA increases, as there are going to be a lot more random index lookups needed to be performed because of the increase in the number of rows needed to be joined from supplier table (left operand).

Test Case D – Join a large data set (>1M rows) from one table with a large table

The SQL used for this test together with its EXPLAIN output as returned by MySQL 5.5 is as follows:

And the results in seconds of time taken to complete the above query:

Here we can clearly see that MySQL 5.5 beats both BKA of MySQL 5.6 and Hash Join of MariaDB 5.5. In IO bound test MySQL 5.5 takes 2.5x less time to complete the query as compared to MySQL 5.6 BKA algorithm, and takes 6.6x less time as compared to MariaDB 5.5 Hash Join, Hash Join also performs worse as compared to BKA and takes 2.5x more time. While for in-memory workload test, MySQL 5.5 takes 5x less time as compared to MySQL 5.6 and 13x less time as compared to MariaDB 5.5 Hash Join. First thing to note is that the above query would be reading 1/3 the number of rows in the table orders (left operand), and so MySQL 5.5 prefers to do a PRIMARY index scan of the table orders resulting in sequential IO, while both MySQL 5.6 and MariaDB 5.5 prefer to do an index range scan on the secondary key o_orderdate which results in random scans of the PK to fetch the columns that are not part of the secondary key. Even though MySQL 5.6 uses MRR to offset the effect of random access of PK, even then it proves to be costly. Also note that the table lineitem, is joined by the column l_orderkey which is the left-most PK column, so reading the table orders in PK order has another benefit that it implies reading the table lineitem in PK order. Hence, these benefits mean MySQL 5.5 wins. But why does Hash Join take so much more time. The reason is that the rows needed to be read from the left operand which is the table orders are far greater than the size of the join buffer. The size of the join buffer is 32M, while the size of the left operand is 186M which means roughly 6 scans of the right operand which is the table lineitem. Hence the reason why hash join is slow in this case, because we have to refill the join buffer with rows from orders table many times, and hash join is not that good if you need many scans of the right operand (in this case the table lineitem).

Another difference with the query in this test case is that, while with the queries in previous test cases, the joining key from the table supplier would match approximately 600 rows from the table lineitem for each distinct key value, in this test case D, the joining key from the table orders would match approximately 5 rows from the table lineitem for each distinct key value. Also the joining key in this test case D is PK in one table and left-most part of the PK in the second table.

How does optimizer work with the different Join Algorithms available?

Currently, the part of the optimizer that is responsible for choosing the join algorithm for a particular query and QEP is not advanced enough and there is work to be done yet. As I understand it MariaDB folks are working on the cost-based choice for any joins. It’s not easy because the current costing model is primitive and must be enhanced to support the possibility of existence of different join algorithms. So what does that mean to MariaDB/MySQL users right now with the state of the current optimizer. Right now you would have to manually enable and disable the join algorithms for the optimizer to choose from.
In MariaDB, every algorithm has a number given to it:
1 – flat BNL
2 – incremental BNL
3 – flat BNLH
4 – incremental BNLH
5 – flat BKA
6 – incremental BKA
7 – flat BKAH
8 – incremental BKAH

The variable join_cache_level controls which algorithms are enabled. If join_cache_level=4 all algorithms numbered 1 to 4 are enabled, if join_cache_level=8, all algorithms numbered 1 to 8 are enabled. Optimizer is naive in the sense that it always uses the max values join algorithm. If join_cache_level=4 it always uses BNLH (hash join), if join_cache_level=8 it always uses BKAH (a variant of BKA). Optimizer does not try to check which algorithm is the best one to use, it just assumes that the algorithm with the highest numeric value is the best one.
So we can force the join algorithm used by setting appropriate values of “join_cache_level”. For example in my test I forced the optimizer to use hash join by setting join_cache_level=4. We can set certain rules for which certain join algorithms are best and then use that algorithm by making use of the variable “join_cache_level”.

Conclusion

Based on the above information and the benchmark results for different test cases, we can see where Hash Joins work best and where they don’t. First of all Hash joins only work for equijoins. Hash join work best when you are joining very big tables with no WHERE clause, or a WHERE clause on a non-indexed column. They also provide big improvement in query response time when you are joining tables with no indexes on the join condition (Full Join). The best performance with Hash Join can be achieved when the left table can fit completely in the join buffer, or when the least amount of buffer refills are needed, as each buffer refill means a scan of the right-side table. However, Hash joins do not outperform BNL or BKA when you are joining a really small subset of rows, as then scanning the right-side table becomes costly in comparison. Block Nested Loop Join would perform better than Hash Join when you are joining two tables on a PK column such that both tables are read in PK order. One use case that I can think of for hash joins is data warehouse applications that need to run reporting queries that need to join on lookup tables which tend to be small mostly. What use cases can you think of?

Comments

  1. Antonio says:

    Hi Ovais,

    great post here. We have had a go at MariaDB to verify if hash join could bring new life to our DWH and indeed performance gains have been incredible.
    Unfortunately though after only two days of operations we ran into severe instability of the whole platform. Specifically MariaDB did not seem able to correctly terminate processes leaving them as “killed” indefinitely but still utilising a full thread/cpu power (100.3%).

    On top of it this behavior also prevented MySQL daemon from shutting down and MariaDB/Mysqld prevented RHEL server from shutting down leaving us with no choice but abruptly power off the server.

    This behavior has been registered more frequently when using

    join_cache_level=4 (incremental BNLH)

    More to come…we have 2 different applications calling the DWH and they are developed in Java and C#. All our applications would run often in “Mysql had gone away” error or “Lost connection to MySQL server during query ” error. same happened using SQLYog as client to debug and execute queries.

    By setting the above variable to 6 (incremental BKA) we managed to stabilise the platform a bit but performance were sometimes halved compared to BNLH.
    The only way to make the whole system totally stable was to go back to traditional BNL joins, giving up every performance advantage that MariaDB could offer.

    We tested MariaDB on 2 servers running both Centos 6.4 64bit but on different hardware and results were exactly the same.
    Have you ever experienced such behavior? Can you guess what was causing it?

    Thanks again for the very informative post.

    Antonio

  2. Hi Antonio,

    It would probably be difficult to guess what was happening there in your setup but it sounds like a bug. I haven’t seen BNLH causing such issues. You may want to file a bug with MariaDB and with a test case which will make it easier to track it down and fix the cause.

    On the other hand errors such as “Mysql had gone away” error or “Lost connection to MySQL server during query” are more linked to improper connection termination and I wouldn’t really blame the join algorithm for that as the join algorithm executes at a different layer while those errors appear to happen at the communication layer.

    Did you try out MySQL 5.6 on your workload, it has many performance improvements especially targeting IO bound workloads which is what we normally have in a DWH app.

Speak Your Mind

*