In this blog post, we will describe typical usage scenarios for dictionary operations in the Data Masking Component, which is available in Percona Server for MySQL as an open source alternative to Oracle’s enterprise version.
In particular, we will consider the following functions.
- gen_dictionary() – a function that returns a random term from a dictionary.
- gen_blocklist() – a function that replaces a term from one dictionary with a randomly selected term in another dictionary.
Generating random terms – gen_dictionary()
For instance, if you need to get a random word from a set of predefined values (a dictionary), you need to create this dictionary first. Just for simplicity, let’s create a dictionary called “colors” and fill it with the seven colors of the rainbow.
All the examples below assume that not only did you install the Data Masking Component but also created the ‘mysql.masking_dictionaries’ table, granted “MASKING_DICTIONARIES_ADMIN” privilege to your current MySQL user, and also granted required privileges to the “mysql.session@localhost” user as per instructions here and here.
1 2 3 4 5 6 7 |
SELECT masking_dictionary_term_add('colors', 'red'); SELECT masking_dictionary_term_add('colors', 'orange'); SELECT masking_dictionary_term_add('colors', 'yellow'); SELECT masking_dictionary_term_add('colors', 'green'); SELECT masking_dictionary_term_add('colors', 'blue'); SELECT masking_dictionary_term_add('colors', 'indigo'); SELECT masking_dictionary_term_add('colors', 'violet'); |
Now, if we want to get a random color, we can execute the following query:
1 |
SELECT gen_dictionary('colors') AS random_color; |
One of the possible outcomes could be as follows:
1 2 3 4 5 6 |
+--------------+ | random_color | +--------------+ | orange | +--------------+ 1 row in set (0.01 sec) |
Or, in a more sophisticated scenario, when we need ten random colors, we can execute:
1 |
SELECT gen_dictionary('colors') AS random_color FROM PERCONA_SEQUENCE_TABLE(10) AS tt; |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
+--------------+ | random_color | +--------------+ | violet | | violet | | green | | yellow | | blue | | violet | | red | | red | | violet | | yellow | +--------------+ 10 rows in set (0.01 sec) |
Replacing sensitive terms – gen_blocklist()
In the following set of examples, let’s create a table “tbl” and fill it with ids and random colors.
1 2 |
CREATE TABLE tbl AS SELECT tt.value + 1 AS id, gen_dictionary('colors') AS random_color FROM PERCONA_SEQUENCE_TABLE(10) AS tt; |
1 |
SELECT * FROM tbl; |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
+----+--------------+ | id | random_color | +----+--------------+ | 1 | indigo | | 2 | yellow | | 3 | red | | 4 | violet | | 5 | orange | | 6 | green | | 7 | red | | 8 | violet | | 9 | red | | 10 | violet | +----+--------------+ 10 rows in set (0.00 sec) |
Then, modify “random_color” in rows with an even “id” with its value spelled backwards so that we would have a mixture of original colors and some other values.
1 |
UPDATE tbl SET random_color = REVERSE(random_color) WHERE id % 2 = 0; |
1 |
SELECT * FROM tbl; |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
+----+--------------+ | id | random_color | +----+--------------+ | 1 | indigo | | 2 | wolley | | 3 | red | | 4 | teloiv | | 5 | orange | | 6 | neerg | | 7 | red | | 8 | teloiv | | 9 | red | | 10 | teloiv | +----+--------------+ 10 rows in set (0.00 sec) |
Now, imagine a scenario when color names become a really sensitive piece of information, and they should never be revealed to the public when we query data from our “tbl” table.
In order to achieve this, let’s create another dictionary called “masked_colors.”
1 2 3 |
SELECT masking_dictionary_term_add('masked_colors', 'colorA'); SELECT masking_dictionary_term_add('masked_colors', 'colorB'); SELECT masking_dictionary_term_add('masked_colors', 'colorC'); |
Please note that the number of unique “masked_colors” (3) differs from the number of original “colors” (7). In this example, we chose this approach on purpose.
Returning to our primary task of masking original colors, all we have to do now is use the “gen_blocklist()” function.
1 |
SELECT id, gen_blocklist(random_color, 'colors', 'masked_colors') AS masked_colors FROM tbl; |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
+----+---------------+ | id | masked_colors | +----+---------------+ | 1 | colorA | | 2 | wolley | | 3 | colorC | | 4 | teloiv | | 5 | colorC | | 6 | neerg | | 7 | colorC | | 8 | teloiv | | 9 | colorB | | 10 | teloiv | +----+---------------+ 10 rows in set (0.00 sec) |
What happened here is that each term that was found in the “colors” dictionary was masked (substituted) with a random term from the “masked_colors” dictionary. At the same time, the unknown values (those that we reversed before) were not modified. Please also notice that there is no direct, unambiguous mapping between the terms in these dictionaries. Random factors play a significant role in this process, and the output of this query could be different on every execution. In this particular run, the same color “red” on the rows with “id” 7 and 9 got masked to different values “colorC” and “colorB”. On the other hand, the values with different colors “red” on the row with “id” 3 and “orange” on the row with “id” 5 got transformed into the same masked value “colorC”.
Conclusion
Although the demonstrated examples are, by their nature, a bit artificial, and real-life scenarios can be much more complicated, what we have shown in this blog post could be a good starting point for developing custom Data Masking solutions for real production environments. Moreover, taking into account that these dictionary-based operations became noticeably faster in Percona Server for MySQL 8.0.41 / 8.4.4, these techniques can be applied to data sets of a much larger scale. Check the “Dictionary term cache in Data Masking Component” blog post for more details.