Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Nondeterministic Functions in MySQL (i.e. rand) Can Surprise You

December 5, 2018

Author

Alexander Rubin

Insight for DBAs

Insight for Developers

MySQL

Share this Post:

Working on a test case with sysbench, I encountered this:

mysql> select * from sbtest1 where id = round(rand()*10000, 0);
+------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| id   | k      | c                                                                                                                       | pad                                                         |
+------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
|  179 | 499871 | 09833083632-34593445843-98203182724-77632394229-31240034691-22855093589-98577647071-95962909368-34814236148-76937610370 | 62233363025-41327474153-95482195752-11204169522-13131828192 |
| 1606 | 502031 | 81212399253-12831141664-41940957498-63947990218-16408477860-15124776228-42269003436-07293216458-45216889819-75452278174 | 25423822623-32136209218-60113604068-17409951653-00581045257 |
+------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
2 rows in set (0.30 sec)

mysql> select * from sbtest1 where id = round(rand()*10000, 0);

+------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+

| id | k | c | pad |

| 179 | 499871 | 09833083632-34593445843-98203182724-77632394229-31240034691-22855093589-98577647071-95962909368-34814236148-76937610370 | 62233363025-41327474153-95482195752-11204169522-13131828192 |

| 1606 | 502031 | 81212399253-12831141664-41940957498-63947990218-16408477860-15124776228-42269003436-07293216458-45216889819-75452278174 | 25423822623-32136209218-60113604068-17409951653-00581045257 |

2 rows in set (0.30 sec)

I was really surprised. First, and the most important, id is a primary key and the rand() function should produce just one value. How come it returns two rows? Second, why is the response time 0.30 sec? That seems really high for a primary key access.

Looking further:

CREATE TABLE `sbtest1` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `k` int(11) NOT NULL DEFAULT '0',
  `c` char(120) NOT NULL DEFAULT '',
  `pad` char(60) NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `k_1` (`k`)
) ENGINE=InnoDB AUTO_INCREMENT=1000001 DEFAULT CHARSET=latin1

mysql> explain select * from sbtest1 where id = round(rand()*10000, 0);
+----+-------------+---------+------------+------+---------------+------+---------+------+--------+----------+-------------+
| id | select_type | table   | partitions | type | possible_keys | key  | key_len | ref  | rows   | filtered | Extra       |
+----+-------------+---------+------------+------+---------------+------+---------+------+--------+----------+-------------+
|  1 | SIMPLE      | sbtest1 | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 986400 |    10.00 | Using where |
+----+-------------+---------+------------+------+---------------+------+---------+------+--------+----------+-------------+

CREATE TABLE `sbtest1` (

`id` int(11) NOT NULL AUTO_INCREMENT,

`k` int(11) NOT NULL DEFAULT '0',

`c` char(120) NOT NULL DEFAULT '',

`pad` char(60) NOT NULL DEFAULT '',

PRIMARY KEY (`id`),

KEY `k_1` (`k`)

) ENGINE=InnoDB AUTO_INCREMENT=1000001 DEFAULT CHARSET=latin1

mysql> explain select * from sbtest1 where id = round(rand()*10000, 0);

+----+-------------+---------+------------+------+---------------+------+---------+------+--------+----------+-------------+

+----+-------------+---------+------------+------+---------------+------+---------+------+--------+----------+-------------+

+----+-------------+---------+------------+------+---------------+------+---------+------+--------+----------+-------------+

So it is a primary key, but MySQL does not use an index, and it returns two rows. Is this a bug?

Deterministic vs nondeterministic functions

Turned out it is not a bug at all. It is pretty logical behavior from MySQL, but it is not what we would expect. First, why a full table scan? Well, rand() is nondeterministic function. That means we do not know what it will return ahead of time, and actually that is exactly the purpose of rand() – to return a random value. In this case, it is only logical to evaluate the function for each row, each time, and compare the results. i.e. in our case

1. Read row 1, get the value of id, evaluate the value of RAND(), compare

1. Proceed using the same algorithm with the remaining rows.

In other words, as the value of rand() is not known (not evaluated) beforehand, so we can’t use an index.

And in this case – rand() function – we have another interesting consequence. For larger tables with an auto_increment primary key, the probability of matching the rand() value and the auto_increment value is higher, so we can get multiple rows back. In fact, if we read the whole table from the beginning and keep comparing the auto_inc sequence with “the roll of the dice”, we can get many rows back.

That behavior is totally counter-intuitive. Nevertheless, to me, it’s also the only correct behavior.

We expect to have the rand() function evaluated before running the query. This can actually be achieved by assigning rand() to a variable:

mysql> set @id=round(rand()*10000, 0); select @id; select * from sbtest1 where id = @id;
Query OK, 0 rows affected (0.00 sec)

+------+
| @id  |
+------+
| 6068 |
+------+
1 row in set (0.00 sec)

+------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| id   | k      | c                                                                                                                       | pad                                                         |
+------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| 6068 | 502782 | 84971025350-12845068791-61736600622-38249796467-85706778555-74134284808-24438972515-17848828748-86869270666-01547789681 | 17507194006-70651503059-23792945260-94159543806-65683812344 |
+------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> explain select * from sbtest1 where id = @id;
+----+-------------+---------+------------+-------+---------------+---------+---------+-------+------+----------+-------+
| id | select_type | table   | partitions | type  | possible_keys | key     | key_len | ref   | rows | filtered | Extra |
+----+-------------+---------+------------+-------+---------------+---------+---------+-------+------+----------+-------+
|  1 | SIMPLE      | sbtest1 | NULL       | const | PRIMARY       | PRIMARY | 4       | const |    1 |   100.00 | NULL  |
+----+-------------+---------+------------+-------+---------------+---------+---------+-------+------+----------+-------+
1 row in set, 1 warning (0.01 sec)

mysql> set @id=round(rand()*10000, 0); select @id; select * from sbtest1 where id = @id;

Query OK, 0 rows affected (0.00 sec)

+------+

| @id |

+------+

| 6068 |

+------+

1 row in set (0.00 sec)

| id | k | c | pad |

| 6068 | 502782 | 84971025350-12845068791-61736600622-38249796467-85706778555-74134284808-24438972515-17848828748-86869270666-01547789681 | 17507194006-70651503059-23792945260-94159543806-65683812344 |

1 row in set (0.00 sec)

mysql> explain select * from sbtest1 where id = @id;

+----+-------------+---------+------------+-------+---------------+---------+---------+-------+------+----------+-------+

+----+-------------+---------+------------+-------+---------------+---------+---------+-------+------+----------+-------+

+----+-------------+---------+------------+-------+---------------+---------+---------+-------+------+----------+-------+

1 row in set, 1 warning (0.01 sec)

This would meet our expectations.

There are (at least) two bug reports filed, with very interesting discussion:

1. rand() used in scalar functions returns multiple rows

1. SELECT on PK with ROUND(RAND()) give wrong errors

Other databases

I wanted to see how it works in other SQL databases. In PostgreSQL, the behavior is exactly the same as MySQL:

postgres=# select * from t2 where id = cast(random()*10000 as int);
  id  |    c
------+---------
 4093 | asdasda
 9378 | asdasda
(2 rows)

postgres=# select * from t2 where id = cast(random()*10000 as int);
  id  |    c
------+---------
 5988 | asdasda
 6674 | asdasda
(2 rows)

postgres=# explain select * from t2 where id = cast(random()*10000 as int);
                             QUERY PLAN
--------------------------------------------------------------------
 Seq Scan on t2  (cost=0.00..159837.60 rows=1 width=12)
   Filter: (id = ((random() * '10000'::double precision))::integer)
(2 rows)

postgres=# select * from t2 where id = cast(random()*10000 as int);

id | c

------+---------

4093 | asdasda

9378 | asdasda

(2 rows)

postgres=# select * from t2 where id = cast(random()*10000 as int);

id | c

------+---------

5988 | asdasda

6674 | asdasda

(2 rows)

postgres=# explain select * from t2 where id = cast(random()*10000 as int);

QUERY PLAN

--------------------------------------------------------------------

Seq Scan on t2 (cost=0.00..159837.60 rows=1 width=12)

Filter: (id = ((random() * '10000'::double precision))::integer)

(2 rows)

And SQLite seems different, evaluating the random() function beforehand:

sqlite> select * from t2 where id = cast(abs(CAST(random() AS REAL))/92233720368547 as int);
16239|asdsadasdsa
sqlite> select * from t2 where id = cast(abs(CAST(random() AS REAL))/92233720368547 as int);
32910|asdsadasdsa
sqlite> select * from t2 where id = cast(abs(CAST(random() AS REAL))/92233720368547 as int);
58658|asdsadasdsa
sqlite> explain select * from t2 where id = cast(abs(CAST(random() AS REAL))/92233720368547 as int);
addr  opcode         p1    p2    p3    p4             p5  comment
----  -------------  ----  ----  ----  -------------  --  -------------
0     Init           0     12    0                    00  Start at 12
1     OpenRead       0     30182  0     2              00  root=30182 iDb=0; t2
2     Function0      0     0     3     random(0)      00  r[3]=func(r[0])
3     Cast           3     69    0                    00  affinity(r[3])
4     Function0      0     3     2     abs(1)         01  r[2]=func(r[3])
5     Divide         4     2     1                    00  r[1]=r[2]/r[4]
6     Cast           1     68    0                    00  affinity(r[1])
7     SeekRowid      0     11    1                    00  intkey=r[1]; pk
8     Copy           1     5     0                    00  r[5]=r[1]
9     Column         0     1     6                    00  r[6]=t2.c
10    ResultRow      5     2     0                    00  output=r[5..6]
11    Halt           0     0     0                    00
12    Transaction    0     0     2     0              01  usesStmtJournal=0
13    Int64          0     4     0     92233720368547  00 r[4]=92233720368547
14    Goto           0     1     0                    00

sqlite> select * from t2 where id = cast(abs(CAST(random() AS REAL))/92233720368547 as int);

16239|asdsadasdsa

sqlite> select * from t2 where id = cast(abs(CAST(random() AS REAL))/92233720368547 as int);

32910|asdsadasdsa

sqlite> select * from t2 where id = cast(abs(CAST(random() AS REAL))/92233720368547 as int);

58658|asdsadasdsa

sqlite> explain select * from t2 where id = cast(abs(CAST(random() AS REAL))/92233720368547 as int);

addr opcode p1 p2 p3 p4 p5 comment

---- ------------- ---- ---- ---- ------------- -- -------------

0 Init 0 12 0 00 Start at 12

1 OpenRead 0 30182 0 2 00 root=30182 iDb=0; t2

2 Function0 0 0 3 random(0) 00 r[3]=func(r[0])

3 Cast 3 69 0 00 affinity(r[3])

4 Function0 0 3 2 abs(1) 01 r[2]=func(r[3])

5 Divide 4 2 1 00 r[1]=r[2]/r[4]

6 Cast 1 68 0 00 affinity(r[1])

7 SeekRowid 0 11 1 00 intkey=r[1]; pk

8 Copy 1 5 0 00 r[5]=r[1]

9 Column 0 1 6 00 r[6]=t2.c

10 ResultRow 5 2 0 00 output=r[5..6]

11 Halt 0 0 0 00

12 Transaction 0 0 2 0 01 usesStmtJournal=0

13 Int64 0 4 0 92233720368547 00 r[4]=92233720368547

14 Goto 0 1 0 00

Conclusion

Be careful when using MySQL nondeterministic functions in a “where” condition – rand() is the most interesting example – as their behavior may surprise you. Many people believe this to be a bug that should be fixed. Let me know in the comments: do you think it is a bug or not (and why)? I would also be interested to know how it works in other, non-opensource databases (Microsoft SQL Server, Oracle, etc)

PS: Finally, I’ve got a “clever” idea – what if I “trick” MySQL by using the deterministic keyword…

MySQL stored functions: deterministic vs not deterministic

So, I wanted to see how it works with MySQL stored functions if they are assigned “deterministic” and “not deterministic” keywords. First, I wanted to “trick” mysql and pass the deterministic to the stored function but use rand() inside. Ok, this is not what you really want to do!

DELIMITER $$

CREATE FUNCTION myrand() RETURNS INT
    DETERMINISTIC
BEGIN
 RETURN round(rand()*10000, 0);
END$$

DELIMITER ;

DELIMITER $$

CREATE FUNCTION myrand() RETURNS INT

DETERMINISTIC

BEGIN

RETURN round(rand()*10000, 0);

END$$

DELIMITER ;

From MySQL manual about MySQL stored routines we can read:

Assessment of the nature of a routine is based on the “honesty” of the creator: MySQL does not check that a routine declared DETERMINISTIC is free of statements that produce nondeterministic results. However, misdeclaring a routine might affect results or affect performance. Declaring a nondeterministic routine as DETERMINISTIC might lead to unexpected results by causing the optimizer to make incorrect execution plan choices. Declaring a deterministic routine as NONDETERMINISTIC might diminish performance by causing available optimizations not to be used.

The result is interesting:

mysql> select myrand();
+----------+
| myrand() |
+----------+
|     4202 |
+----------+
1 row in set (0.00 sec)

mysql> select myrand();
+----------+
| myrand() |
+----------+
|     7548 |
+----------+
1 row in set (0.00 sec)

mysql> explain select * from t2 where id = myrand()G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: NULL
   partitions: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
     filtered: NULL
        Extra: Impossible WHERE noticed after reading const tables
1 row in set, 1 warning (0.00 sec)

mysql> show warnings;
+-------+------+--------------------------------------------------------------------------------+
| Level | Code | Message                                                                        |
+-------+------+--------------------------------------------------------------------------------+
| Note  | 1003 | /* select#1 */ select '2745' AS `id`,'asasdas' AS `c` from `test`.`t2` where 0 |
+-------+------+--------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> select * from t2 where id = 4202;
+------+---------+
| id   | c       |
+------+---------+
| 4202 | asasdas |
+------+---------+
1 row in set (0.00 sec)

mysql> select * from t2 where id = 2745;
+------+---------+
| id   | c       |
+------+---------+
| 2745 | asasdas |
+------+---------+
1 row in set (0.00 sec)

mysql> select myrand();

+----------+

| myrand() |

+----------+

| 4202 |

+----------+

1 row in set (0.00 sec)

mysql> select myrand();

+----------+

| myrand() |

+----------+

| 7548 |

+----------+

1 row in set (0.00 sec)

mysql> explain select * from t2 where id = myrand()G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: NULL

partitions: NULL

type: NULL

possible_keys: NULL

key: NULL

key_len: NULL

ref: NULL

rows: NULL

filtered: NULL

Extra: Impossible WHERE noticed after reading const tables

1 row in set, 1 warning (0.00 sec)

mysql> show warnings;

+-------+------+--------------------------------------------------------------------------------+

| Level | Code | Message |

+-------+------+--------------------------------------------------------------------------------+

| Note | 1003 | /* select#1 */ select '2745' AS `id`,'asasdas' AS `c` from `test`.`t2` where 0 |

+-------+------+--------------------------------------------------------------------------------+

1 row in set (0.00 sec)

mysql> select * from t2 where id = 4202;

+------+---------+

| id | c |

+------+---------+

| 4202 | asasdas |

+------+---------+

1 row in set (0.00 sec)

mysql> select * from t2 where id = 2745;

+------+---------+

| id | c |

+------+---------+

| 2745 | asasdas |

+------+---------+

1 row in set (0.00 sec)

So MySQL optimizer detected the problem (somehow).

If I use the NOT DETERMINISTIC keyword, then MySQL works the same as when using the rand() function:

DELIMITER $$

CREATE FUNCTION myrand2() RETURNS INT
   NOT DETERMINISTIC
BEGIN
 RETURN round(rand()*10000, 0);
END$$

DELIMITER ;

mysql> explain select * from t2 where id = myrand2()G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 262208
     filtered: 10.00
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

DELIMITER $$

CREATE FUNCTION myrand2() RETURNS INT

NOT DETERMINISTIC

BEGIN

RETURN round(rand()*10000, 0);

END$$

DELIMITER ;

mysql> explain select * from t2 where id = myrand2()G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: t2

partitions: NULL

type: ALL

possible_keys: NULL

key: NULL

key_len: NULL

ref: NULL

rows: 262208

filtered: 10.00

Extra: Using where

1 row in set, 1 warning (0.00 sec)

—
Photo by dylan nolte on Unsplash

0 0 votes

Article Rating

6 Comments

Oldest

Newest Most Voted

Ernie Souhrada

7 years ago

I think this is absolutely a bug. Standard rules of algebra would say that we evaluate expressions from the inside out, so we should be able to do the RAND(), then the ROUND(), and be left with a statement equivalent to SELECT * FROM table WHERE id = – in which case we should get back 0 or 1 rows.

The RAND() function is nondeterministic, yes, but it is never going to return more than one value – as opposed to some nondeterministic function that might return an unknown number of rows – so why can the optimizer / parser not be smart enough to figure that out? It used to be the case (and maybe it still is) that if you tried to do an EXPLAIN on a SELECT that contained a subquery, MySQL would actually run the subquery before giving you the optimizer’s plan – this seems like a much less complicated variant of that scenario.

Joe

7 years ago

I say it’s not a bug. Imagine doing a simulation. Say you had a column containing the object’s (represented by the row) chance of success. Each time you run the query, you want back a random list of rows which “succeeded’ during the simulation. You’d write something like:

SELECT * FROM agents WHERE success_prob > rand()

If rand() evaluated before the the query executed (instead of row by row), then all of the rows that returned when rand() = .2 would also be returned when rand() = .1 (plus some additional rows). There would be no variation among which rows got selected… just among the quantity of rows selected .

The author correctly points out that a variable should be used to store a single random value before query exeution if you want a single random value to evaluate against all rows. But the WHERE clause evaluates for each row, so it makes sense that rand() should evaluate for each row separately if it is included in the WHERE clause.

Admin

Peter Zaitsev

7 years ago

You may also wish to check this HackerNews Discussion https://news.ycombinator.com/item?id=18617809

Øystein Grøvlen

7 years ago

AFAICT, the described behavior is according to the SQL standard. In theory, a WHERE clause should be evaluated for all rows. A database system may optimize this and not inspect every row if it has some other mechanism to identify which rows will satisfy the WHERE clause (e.g., an index). Also, functions calls that are constant (i.e., deterministic and does not depend on input from the current row) may be evaluated once. In this case, the function is non-deterministic, so it needs to be evaluated on every row.

Author

Alexander Rubin

7 years ago

Reply to Øystein Grøvlen

Yes, I agree here. What is interesting, it seems that MySQL and PostgreSQL have this (described) behavior, but SQLite and (it seems like) MS SQL server has different.

Author

Alexander Rubin

7 years ago

Reply to Øystein Grøvlen

Ernie, I think a next reply by Joe shows a good point, really non-deterministic function should be evaluated for each row. Altho it is not what we can expect