Finding Table Differences on Nullable Columns Using MySQL Generated Columns

MySQL generated columnsSome time ago, a customer had a performance issue with an internal process. He was comparing, finding, and reporting the rows that were different between two tables. This is simple if you use a LEFT JOIN and an  IS NULL  comparison over the second table in the WHERE clause, but what if the column could be null? That is why he used UNION, GROUP BY and a HAVING clauses, which resulted in poor performance.

The challenge was to be able to compare each row using a LEFT JOIN over NULL values.

The challenge in more detail

I’m not going to use the customer’s real table. Instead, I will be comparing two sysbench tables with the same structure:

It is sightly different from the original sysbench schema, as this version can hold NULL values. Both tables have the same number of rows. We are going to set to NULL one row on each table:

If we execute the comparison query, we get this result:

As you can see, column k is NULL. In both cases it failed and reported those rows to be different. This is not new in MySQL, but it would be nice to have a way to sort this issue out.


The solution is based on GENERATED COLUMNS with a hash function (md5) and stored in a binary(16) column:

Adding the index is also part of the solution. Now, let’s execute the query using the new column to join the tables:

We can see an improvement in the query performance—it now takes 2.31 sec whereas before it was 3.00 sec—and that the result is as expected. We could say that that’s all, and no possible improvement can be made. However, is not true. Even though the query is running faster, it is possible to optimize it in this way:

Why is this faster? The first query is performing two subqueries. Each subquery is very similar. Let’s check the explain plan:

As you can see, it is performing a full table scan over the first table and using real_id to join the second table. The real_id is a generated column, so it needs to execute the function to get the value to join the second table. That means that it’s going to take time.

If we analyze the subquery of the second query:

We are going to see that it is performing a full index scan over the first table, and that the generated column has never been executed. That is how we can go from an inconsistent result of three seconds, to a consistent result of 2.31 seconds, to finally reach a performant query using the faster time of 1.60 seconds.


This is not the first blog post that I’ve done about generated columns. I think that it is a useful feature for several scenarios where you need to improve performance. In this particular case, it’s also presenting a workaround to expected inconsistencies with LEFT JOINS with NULL values. It is also important to mention that this improved a process in a real world scenario.

Share this post

Comments (7)

  • Jouni Järvinen Reply

    Jeez … MD5 is unreliable even if it’s pretty fast. Not storing the hash is a bonus though, saving space.

    October 3, 2018 at 11:26 am
    • David Ducos Reply

      Hi Jouni, thank you for your comment! I chose MD5 for no particular reason, as later in the customers process, there is a second validation. Feel free to choose any hash function that suit the best for your needs.

      October 3, 2018 at 5:53 pm
  • Guilhem Bichot Reply

    Hello. The original problem, that two rows with a same NULL value are not considered equal (and thus are false positives in the first difference-finding query), is because the USING clause implicitely translates to equality conditions, which reject NULLs; it can be solved by using a clause with the operator; instead of:
    sbtest1 a left join sbtest2 b using (k,c,pad) where is null,
    sbtest1 a left join sbtest2 b on a.kb.k and a.cb.c and a.padb.pad where is null .
    Guilhem (MySQL dev team)

    October 10, 2018 at 11:22 am
    • Guilhem Bichot Reply

      Alas the formatting mangled operators in my query. It’s “on a.k OP b.k etc”, where OP is the NULL-safe equal described here:

      October 11, 2018 at 3:28 pm
    • David Ducos Reply

      Hi Guihem, you are right, it is possible to use it, I have to be honest, I didn’t test it before to publish this post. However, I must say that, in my preliminary test, the performance didn’t increase significantly as it did with the generated columns which was one of the goals, I will do some more research a update it. We could discuss about the performance on inserts or other performance penalties that the use of generated columns could have, but my idea is come up with out of the box solutions, which is one of the reason why I like so much this feature.

      October 11, 2018 at 3:41 pm
  • Vishal Saxena Reply

    If table has primary keys then “mysqldump” can also be used with where clause to validate the data in chunks between source and destination.”md5sum” can be used on dumped files to compare the checksums. percona tool is also there but i guess that forces “statement” binary logs and may not work in “row level” replication.

    October 24, 2018 at 10:49 am

Leave a Reply