Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

A Look at MariaDB Subquery Cache

March 3, 2017

Author

Federico Razzoli

MariaDB

MySQL

Share this Post:

The MariaDB subquery cache feature added in MariaDB 5.3 is not widely known. Let’s see what it is and how it works.

What is a subquery cache?

The MariaDB subquery cache optimizes the execution of correlated subqueries. Correlated subqueries refer to a value from the parent query. For example:

SELECT id FROM product WHERE price NOT IN (SELECT MAX(price) FROM product GROUP BY category);

1	SELECT id FROM product WHERE price NOT IN (SELECT MAX(price) FROM product GROUP BY category);

MariaDB only uses this optimization if the parent query is a SELECT, not an UPDATE or a DELETE. The subquery results get cached only for the duration of the parent query.

MariaDB added the subquery cache in v5.3. It is controlled by optimizer_switch, a dynamic variable that contains many flags that enable or disable several optimizations. To disable the subquery cache, run:

SET GLOBAL optimizer_switch='subquery_cache=OFF';

1	SET GLOBAL optimizer_switch='subquery_cache=OFF';

You can also do this at the session level.

How does subquery cache work?

Let’s see how it works. To make things clearer, we will use an example. Consider these tables:

CREATE TABLE t1 (a INT, b INT);
INSERT INTO t1 VALUES
(1,2),
(3,4),
(1,2),
(3,4),
(3,4),
(3,5),
(3,5),
(5,1),
(5,2),
(3,6),
(1,5);

CREATE TABLE t2 (c INT, d INT);
INSERT INTO t2 VALUES
(1,10),
(2,20),
(3,30),
(4,40);

CREATE TABLE t1 (a INT, b INT);

INSERT INTO t1 VALUES

(1,2),

(3,4),

(1,2),

(3,4),

(3,5),

(5,1),

(5,2),

(3,6),

(1,5);

CREATE TABLE t2 (c INT, d INT);

INSERT INTO t2 VALUES

(1,10),

(2,20),

(3,30),

(4,40);

Now, we issue this query:

SELECT b, (SELECT d FROM t2 WHERE a = c) FROM t1;

1	SELECT b, (SELECT d FROM t2 WHERE a = c) FROM t1;

The server decides to read t1 first (the bigger table, as expected), and then access t2 using the subquery cache. It creates a MEMORY temporary table to store the results of the subquery, with an index on c (it is used to match the rows). Then it reads the first row from t1, and checks if the search is cached. It is not, so it reads t1 looking for rows with c=1 and copies the results into the cache. The next time it will find the value 1, and it will not need to access t2 because the matches are already cached. If you look at the data, you may notice that the value “5” appears twice in t1 (and is absent in t2). But the search is cached anyway, so the server searches for 5 in t2 only once.

I hope that you aren’t blindly accepting what I wrote until now: good DBAs need facts and metrics. Let’s be scientific: we’ll make a prediction, conduct an experiment and check the status variables to verify the prediction. The prediction is the list of rows that will be read from t1, with the running total of hits and misses:

(1,2) -- Miss: 1
(3,4) -- Miss: 2
(1,2) -- Hit:  1
(3,4) -- Hit:  2
(3,4) -- Hit:  3
(3,5) -- Hit:  4
(3,5) -- Hit:  5
(5,1) -- Miss: 3
(5,2) -- Hit:  6
(3,6) -- Hit:  7
(1,5) -- Hit:  8

(1,2) -- Miss: 1

(3,4) -- Miss: 2

(1,2) -- Hit: 1

(3,4) -- Hit: 2

(3,4) -- Hit: 3

(3,5) -- Hit: 4

(3,5) -- Hit: 5

(5,1) -- Miss: 3

(5,2) -- Hit: 6

(3,6) -- Hit: 7

(1,5) -- Hit: 8

MariaDB [test]> SHOW STATUS LIKE "subquery_cache%";
+---------------------+-------+
| Variable_name | Value |
+---------------------+-------+
| Subquery_cache_hit | 8 |
| Subquery_cache_miss | 3 |
+---------------------+-------+
2 rows in set (0.00 sec)

MariaDB [test]> SHOW STATUS LIKE "handler_tmp_write";
+-------------------+-------+
| Variable_name | Value |
+-------------------+-------+
| Handler_tmp_write | 3 |
+-------------------+-------+
1 row in set (0.00 sec)

MariaDB [test]> SHOW STATUS LIKE "subquery_cache%";

+---------------------+-------+

| Variable_name | Value |

+---------------------+-------+

| Subquery_cache_hit | 8 |

| Subquery_cache_miss | 3 |

+---------------------+-------+

2 rows in set (0.00 sec)

MariaDB [test]> SHOW STATUS LIKE "handler_tmp_write";

+-------------------+-------+

| Variable_name | Value |

+-------------------+-------+

| Handler_tmp_write | 3 |

+-------------------+-------+

1 row in set (0.00 sec)

The totals match, and the number of writes to the cache is equal to the misses (after a miss, a table access is done and cached).

The maximum size of an individual table is the minimum of tmp_table_size and max_heap_table_size. If the table size grows over this limit, the table is written to disk. If the MEMORY table creation fails (perhaps because MEMORY does not support BLOB), the subquery is not cached.

The total of hits and misses can be seen by reading two status variables: subquery_cache_hit and subquery_cache_miss. After 200 misses, the server checks the hit ratio for that particular subquery. If it is < 20%, it disables the cache for that subquery. If the hit rate is < 70% the table cannot be written to disk in case it exceeds the size limit. These numbers (200, 0.2, 0.7) are hardcoded and cannot be changed. But if you really want to test how MariaDB behaves with different parameters, you can change these constants in sql/sql_expression_cache.cc and recompile the server.

Isn’t this subquery materialization?

Subquery materialization is another strategy that the optimizer can choose to execute a query. It might look similar, because some data from a subquery are written to a MEMORY table – but this is the only similarity. The purpose and implementation of subquery materialization is different.

Let’s try to explain this with pseudocode.

Subquery materialization is for uncorrelated IN subqueries. Therefore the subquery is executed and materialized altogether:

# Query to optimize:
SELECT ... WHERE col1 IN (subquery)

materialize subquery into a MEMORY table with UNIQUE keys;
foreach (row in outer query) {
	check if col1 current value exists in materialized table
}

# Query to optimize:

SELECT ... WHERE col1 IN (subquery)

materialize subquery into a MEMORY table with UNIQUE keys;

foreach (row in outer query) {

check if col1 current value exists in materialized table

}

The subquery cache is for correlated subqueries. Thus the subquery gets executed only for non-cached values:

# Query to optimize:
SELECT col1, (SELECT ... WHERE ... = col1) ... FROM ...

foreach (outer query row) {
	if (col1 current value is cached) {
		read from cache
	} else {
		read from subquery
		cache col1 current value
	}
}

# Query to optimize:

SELECT col1, (SELECT ... WHERE ... = col1) ... FROM ...

foreach (outer query row) {

if (col1 current value is cached) {

read from cache

} else {

read from subquery

cache col1 current value

}

Some considerations

Despite the similarity in names, the MariaDB subquery cache is not a query cache for subqueries. These features are different, implemented for different purposes. Obviously, the subquery cache doesn’t have the scalability and performance problems of the query cache (global mutex, table invalidation). As mentioned, a subquery cache table only survives for the duration of a statement, so it should be considered an optimizer strategy. For example, in some cases you might use the subquery cache for a WHERE … NOT IN subquery, but not for the WHERE … IN version, because the optimizer prefers to rewrite it as a JOIN.

Of course, not all correlated subqueries automatically benefit from this feature. Consider the example above: it is built to show that the subquery cache is useful. But we can easily build an example to show that can have a negative impact on performance: add rows to t1, and delete all duplicate values of a. There will be no hits, a temporary table is created, 200 reads and writes are performed, but it won’t help. After 200 misses, the cache will be disabled, yes, but what if this happens for each subquery? The damage may not be huge in a realistic case, but it is still damage. That’s why you can disable the MariaDB subquery cache.

0 0 votes

Article Rating

4 Comments

Oldest

Newest Most Voted

FractalizeR

9 years ago

I wonder why this is
> SET GLOBAL optimizer_switch=’subquery_cache=OFF’;
not
SET GLOBAL subquery_cache=OFF
?

This way we will end up with something like
SET GLOBAL optimizer_switch='subquery_cache="feature1='feature2=true'"';

Author

Federico Razzoli

9 years ago

Reply to FractalizeR

Hi FractalizeR,
Fortunately, you don’t need to set all flags of optimizer_switch. You only specify the flags that you want to change.

William

9 years ago

Hi Federico,

Thanks for the post.

If I understand correctly where it says: “It is not, so it reads t1 looking for rows with c = 1 …” you should say: “It is not, so it reads t2 looking for rows with c = 1 …”, right?. Thank you.