Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Finding Table Differences on Nullable Columns Using MySQL Generated Columns

October 3, 2018

Author

David Ducos

Insight for DBAs

Insight for Developers

MySQL

Share this Post:

MySQL generated columns Some time ago, a customer had a performance issue with an internal process. He was comparing, finding, and reporting the rows that were different between two tables. This is simple if you use a LEFT JOIN and an IS NULL comparison over the second table in the WHERE clause, but what if the column could be null? That is why he used UNION, GROUP BY and a HAVING clauses, which resulted in poor performance.

The challenge was to be able to compare each row using a LEFT JOIN over NULL values.

The challenge in more detail

I’m not going to use the customer’s real table. Instead, I will be comparing two sysbench tables with the same structure:

CREATE TABLE `sbtest1` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `k` int(10) unsigned DEFAULT NULL,
  `c` char(120) DEFAULT NULL,
  `pad` char(60) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `k` (`k`,`c`,`pad`)
) ENGINE=InnoDB

CREATE TABLE `sbtest1` (

`id` int(10) unsigned NOT NULL AUTO_INCREMENT,

`k` int(10) unsigned DEFAULT NULL,

`c` char(120) DEFAULT NULL,

`pad` char(60) DEFAULT NULL,

PRIMARY KEY (`id`),

KEY `k` (`k`,`c`,`pad`)

) ENGINE=InnoDB

It is sightly different from the original sysbench schema, as this version can hold NULL values. Both tables have the same number of rows. We are going to set to NULL one row on each table:

update sbtest1 set k=null where limit 1;
update sbtest2 set k=null where limit 1;

1 2	update sbtest1 set k=null where limit 1; update sbtest2 set k=null where limit 1;

If we execute the comparison query, we get this result:

mysql> select "sbtest1",a.* from 
    -> sbtest1 a left join 
    -> sbtest2 b using (k,c,pad) 
    -> where b.id is null union 
    -> select "sbtest2",a.* from 
    -> sbtest2 a left join 
    -> sbtest1 b using (k,c,pad)
    -> where b.id is null;
+---------+------+------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| sbtest1 | id   | k    | c                                                                                                                       | pad                                                         |
+---------+------+------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| sbtest1 | 4462 | NULL | 64568100364-99474573987-46567807085-85185678273-10829479379-85901445105-43623848418-63872374080-59257878609-82802454375 | 07052127207-33716235481-22978181904-76695680520-07986095803 |
| sbtest2 | 4462 | NULL | 64568100364-99474573987-46567807085-85185678273-10829479379-85901445105-43623848418-63872374080-59257878609-82802454375 | 07052127207-33716235481-22978181904-76695680520-07986095803 |
+---------+------+------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
2 rows in set (3.00 sec)

mysql> select "sbtest1",a.* from

-> sbtest1 a left join

-> sbtest2 b using (k,c,pad)

-> where b.id is null union

-> select "sbtest2",a.* from

-> sbtest2 a left join

-> sbtest1 b using (k,c,pad)

-> where b.id is null;

+---------+------+------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+

| sbtest1 | id | k | c | pad |

| sbtest1 | 4462 | NULL | 64568100364-99474573987-46567807085-85185678273-10829479379-85901445105-43623848418-63872374080-59257878609-82802454375 | 07052127207-33716235481-22978181904-76695680520-07986095803 |

| sbtest2 | 4462 | NULL | 64568100364-99474573987-46567807085-85185678273-10829479379-85901445105-43623848418-63872374080-59257878609-82802454375 | 07052127207-33716235481-22978181904-76695680520-07986095803 |

2 rows in set (3.00 sec)

As you can see, column k is NULL. In both cases it failed and reported those rows to be different. This is not new in MySQL, but it would be nice to have a way to sort this issue out.

Solution

The solution is based on GENERATED COLUMNS with a hash function (md5) and stored in a binary(16) column:

ALTER TABLE sbtest1 
ADD COLUMN `real_id` binary(16) GENERATED ALWAYS AS (unhex(md5(concat(ifnull(`k`,'NULL'),ifnull(`c`,'NULL'),ifnull(`pad`,'NULL'))))) VIRTUAL, 
ADD INDEX (real_id);
ALTER TABLE sbtest2 
ADD COLUMN `real_id` binary(16) GENERATED ALWAYS AS (unhex(md5(concat(ifnull(`k`,'NULL'),ifnull(`c`,'NULL'),ifnull(`pad`,'NULL'))))) VIRTUAL,
ADD INDEX (real_id);

ALTER TABLE sbtest1

ADD COLUMN `real_id` binary(16) GENERATED ALWAYS AS (unhex(md5(concat(ifnull(`k`,'NULL'),ifnull(`c`,'NULL'),ifnull(`pad`,'NULL'))))) VIRTUAL,

ADD INDEX (real_id);

ALTER TABLE sbtest2

ADD COLUMN `real_id` binary(16) GENERATED ALWAYS AS (unhex(md5(concat(ifnull(`k`,'NULL'),ifnull(`c`,'NULL'),ifnull(`pad`,'NULL'))))) VIRTUAL,

ADD INDEX (real_id);

Adding the index is also part of the solution. Now, let’s execute the query using the new column to join the tables:

mysql> select "sbtest1",a.k,a.c,a.pad from
    -> sbtest1 a left join
    -> sbtest2 b using (real_id)
    -> where b.id is null union
    -> select "sbtest2",a.k,a.c,a.pad from
    -> sbtest2 a left join
    -> sbtest1 b using (real_id)
    -> where b.id is null;
Empty set (2.31 sec)

mysql> select "sbtest1",a.k,a.c,a.pad from

-> sbtest1 a left join

-> sbtest2 b using (real_id)

-> where b.id is null union

-> select "sbtest2",a.k,a.c,a.pad from

-> sbtest2 a left join

-> sbtest1 b using (real_id)

-> where b.id is null;

Empty set (2.31 sec)

We can see an improvement in the query performance—it now takes 2.31 sec whereas before it was 3.00 sec—and that the result is as expected. We could say that that’s all, and no possible improvement can be made. However, is not true. Even though the query is running faster, it is possible to optimize it in this way:

mysql> select "sbtest1",a.k,a.c,a.pad
    -> from sbtest1 a
    -> where a.id in (select a.id
    ->   from sbtest1 a left join
    ->   sbtest2 b using (real_id)
    ->   where b.id is null) union
    -> select "sbtest2",a.k,a.c,a.pad
    -> from sbtest2 a
    -> where a.id in (select a.id
    ->   from sbtest2 a left join
    ->   sbtest1 b using (real_id)
    ->   where b.id is null);
Empty set (1.60 sec)

mysql> select "sbtest1",a.k,a.c,a.pad

-> from sbtest1 a

-> where a.id in (select a.id

-> from sbtest1 a left join

-> sbtest2 b using (real_id)

-> where b.id is null) union

-> select "sbtest2",a.k,a.c,a.pad

-> from sbtest2 a

-> where a.id in (select a.id

-> from sbtest2 a left join

-> sbtest1 b using (real_id)

-> where b.id is null);

Empty set (1.60 sec)

Why is this faster? The first query is performing two subqueries. Each subquery is very similar. Let’s check the explain plan:

mysql> explain select "sbtest1",a.k,a.c,a.pad from
    -> sbtest1 a left join
    -> sbtest2 b using (real_id)
    -> where b.id is null;
+----+-------------+-------+------------+------+---------------+---------+---------+------------------+--------+----------+--------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key     | key_len | ref              | rows   | filtered | Extra                                |
+----+-------------+-------+------------+------+---------------+---------+---------+------------------+--------+----------+--------------------------------------+
|  1 | SIMPLE      | a     | NULL       | ALL  | NULL          | NULL    | NULL    | NULL             | 315369 |   100.00 | NULL                                 |
|  1 | SIMPLE      | b     | NULL       | ref  | real_id       | real_id | 17      | sbtest.a.real_id |     27 |    10.00 | Using where; Not exists; Using index |
+----+-------------+-------+------------+------+---------------+---------+---------+------------------+--------+----------+--------------------------------------+

mysql> explain select "sbtest1",a.k,a.c,a.pad from

-> sbtest1 a left join

-> sbtest2 b using (real_id)

-> where b.id is null;

+----+-------------+-------+------------+------+---------------+---------+---------+------------------+--------+----------+--------------------------------------+

+----+-------------+-------+------------+------+---------------+---------+---------+------------------+--------+----------+--------------------------------------+

| 1 | SIMPLE | a | NULL | ALL | NULL | NULL | NULL | NULL | 315369 | 100.00 | NULL |

| 1 | SIMPLE | b | NULL | ref | real_id | real_id | 17 | sbtest.a.real_id | 27 | 10.00 | Using where; Not exists; Using index |

+----+-------------+-------+------------+------+---------------+---------+---------+------------------+--------+----------+--------------------------------------+

As you can see, it is performing a full table scan over the first table and using real_id to join the second table. The real_id is a generated column, so it needs to execute the function to get the value to join the second table. That means that it’s going to take time.

If we analyze the subquery of the second query:

mysql> explain select "sbtest1",a.k,a.c,a.pad
    -> from sbtest1 a
    -> where a.id in (select a.id
    ->   from sbtest1 a left join
    ->   sbtest2 b using (real_id)
    ->   where b.id is null);
+----+--------------+-------------+------------+--------+---------------+------------+---------+------------------+--------+----------+--------------------------------------+
| id | select_type  | table       | partitions | type   | possible_keys | key        | key_len | ref              | rows   | filtered | Extra                                |
+----+--------------+-------------+------------+--------+---------------+------------+---------+------------------+--------+----------+--------------------------------------+
|  1 | SIMPLE       | a           | NULL       | index  | PRIMARY       | k          | 187     | NULL             | 315369 |   100.00 | Using where; Using index             |
|  1 | SIMPLE       | <subquery2> | NULL       | eq_ref | <auto_key>    | <auto_key> | 4       | sbtest.a.id      |      1 |   100.00 | NULL                                 |
|  2 | MATERIALIZED | a           | NULL       | index  | PRIMARY       | real_id    | 17      | NULL             | 315369 |   100.00 | Using index                          |
|  2 | MATERIALIZED | b           | NULL       | ref    | real_id       | real_id    | 17      | sbtest.a.real_id |     27 |    10.00 | Using where; Not exists; Using index |
+----+--------------+-------------+------------+--------+---------------+------------+---------+------------------+--------+----------+--------------------------------------+

mysql> explain select "sbtest1",a.k,a.c,a.pad

-> from sbtest1 a

-> where a.id in (select a.id

-> from sbtest1 a left join

-> sbtest2 b using (real_id)

-> where b.id is null);

+----+--------------+-------------+------------+--------+---------------+------------+---------+------------------+--------+----------+--------------------------------------+

+----+--------------+-------------+------------+--------+---------------+------------+---------+------------------+--------+----------+--------------------------------------+

| 1 | SIMPLE | a | NULL | index | PRIMARY | k | 187 | NULL | 315369 | 100.00 | Using where; Using index |

| 2 | MATERIALIZED | a | NULL | index | PRIMARY | real_id | 17 | NULL | 315369 | 100.00 | Using index |

| 2 | MATERIALIZED | b | NULL | ref | real_id | real_id | 17 | sbtest.a.real_id | 27 | 10.00 | Using where; Not exists; Using index |

+----+--------------+-------------+------------+--------+---------------+------------+---------+------------------+--------+----------+--------------------------------------+

We are going to see that it is performing a full index scan over the first table, and that the generated column has never been executed. That is how we can go from an inconsistent result of three seconds, to a consistent result of 2.31 seconds, to finally reach a performant query using the faster time of 1.60 seconds.

Conclusions

This is not the first blog post that I’ve done about generated columns. I think that it is a useful feature for several scenarios where you need to improve performance. In this particular case, it’s also presenting a workaround to expected inconsistencies with LEFT JOINS with NULL values. It is also important to mention that this improved a process in a real world scenario.

0 0 votes

Article Rating

7 Comments

Oldest

Newest Most Voted

Jouni Järvinen

7 years ago

Jeez … MD5 is unreliable even if it’s pretty fast. Not storing the hash is a bonus though, saving space.

Author

David Ducos

7 years ago

Reply to Jouni Järvinen

Hi Jouni, thank you for your comment! I chose MD5 for no particular reason, as later in the customers process, there is a second validation. Feel free to choose any hash function that suit the best for your needs.
Cheers

Guilhem Bichot

7 years ago

Hello. The original problem, that two rows with a same NULL value are not considered equal (and thus are false positives in the first difference-finding query), is because the USING clause implicitely translates to equality conditions, which reject NULLs; it can be solved by using a clause with the operator; instead of:
sbtest1 a left join sbtest2 b using (k,c,pad) where b.id is null,
do
sbtest1 a left join sbtest2 b on a.kb.k and a.cb.c and a.padb.pad where b.id is null .
Guilhem (MySQL dev team)

Guilhem Bichot

7 years ago

Reply to Guilhem Bichot

Alas the formatting mangled operators in my query. It’s “on a.k OP b.k etc”, where OP is the NULL-safe equal described here:
https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_equal-to

Author

David Ducos

7 years ago

Reply to Guilhem Bichot

Hi Guihem, you are right, it is possible to use it, I have to be honest, I didn’t test it before to publish this post. However, I must say that, in my preliminary test, the performance didn’t increase significantly as it did with the generated columns which was one of the goals, I will do some more research a update it. We could discuss about the performance on inserts or other performance penalties that the use of generated columns could have, but my idea is come up with out of the box solutions, which is one of the reason why I like so much this feature.

Vishal Saxena

7 years ago

If table has primary keys then “mysqldump” can also be used with where clause to validate the data in chunks between source and destination.”md5sum” can be used on dumped files to compare the checksums. percona tool is also there but i guess that forces “statement” binary logs and may not work in “row level” replication.