For a long time, MyDumper has been the fastest tool to take Logical Backups. We have been adding several features to expand the use cases. Masquerade was one of these features, but it was only for integer and UUID values. In this blog post, I’m going to present a new functionality that is available in MyDumper and will be available in the next release: we added the possibility to build random data based on a format that the user defines.
During export, mydumper sends SELECT statements to the database. Each row is written one by one as an INSERT statement. Something important that you might not know, is that each column of a row can be transformed by a function. When you execute a backup, the default function is the identity function, as nothing needs to be changed. The function, which can be configured inside the defaults file, will change the content of the column before writing the row into disk.
I think that the most valuable element of this feature is the simplicity to define which column will be modified and how you want to mask it. The format is:
|
1 |
[`schema_name`.`table_name`]<br>`column1`=random_int<br>`column2`=random_string |
In the section name, you add the schema and table name surrounded by backticks and separated by a dot. Then, each key-value entry will keep in the key the column name surrounded by backticks, and the value will be the masking function definition.
Having string, integer, and UUID is nice to have, but what about build dynamic data with a specific format? As we want more realistic data, we want to build dynamically world wide addresses, phone numbers, emails, etc. The new function has this syntax:
|
1 |
random_format { <{file|string n|number n}> | DELIMITER | 'CONSTANT' }* |
This are some examples:
|
1 |
`phone`=random_format '+1 ('<number 3>') '<number 3>'-'<number 4><br>`emails`=random_format <file names.txt>'.'<file surnames.txt>'@'<file domains.txt><br>`addresses`=random_format <number 3>' '<file streets.txt>', '<file cities.txt>', '<file states_and_zip.txt>', USA' |
You should expect performance degradation if you compare masquerade backups and regular backups. It is impossible to measure the impact as it will depend on the amount of data that needs to be masked. However, I tried to give you an idea through an example over a sysbench table of 10M rows.
We are going to split by rows and compress with ZSTD:
|
1 |
# rm -rf data/; time ./mydumper -o data -B test --defaults-file=mydumper.cnf -r 100000 -c<br>real 0m19.964s<br>user 0m48.396s<br>sys 0m7.885s |
It took near 19.9 seconds to complete, and here is an example of the output:
|
1 |
# zstdcat data/test.sbtest1.00000.sql.zst | grep INSERT -A10 | head<br>INSERT INTO `sbtest1` VALUES(1,4992833,"83868641912-28773972837-60736120486-75162659906-27563526494-20381887404-41576422241-93426793964-56405065102-33518432330","67847967377-48000963322-62604785301-91415491898-96926520291")<br>,(2,5019684,"38014276128-25250245652-62722561801-27818678124-24890218270-18312424692-92565570600-36243745486-21199862476-38576014630","23183251411-36241541236-31706421314-92007079971-60663066966") |
We are going to use random_int over the k column, which in the configuration will be:
|
1 |
[`test`.`sbtest1`]<br>`k`=random_int |
The backup took 20.7 seconds, an increase of 4%:
|
1 |
# rm -rf data/; time ./mydumper -o data -B test --defaults-file=mydumper-k.cnf -r 100000 -c<br>real 0m20.709s<br>user 0m46.056s<br>sys 0m11.247s |
And as you can see, the data in the second column has changed:
|
1 |
# zstdcat data/test.sbtest1.00000.sql.zst | grep INSERT -A10 | head<br>INSERT INTO `sbtest1` VALUES(1,1527173,"83868641912-28773972837-60736120486-75162659906-27563526494-20381887404-41576422241-93426793964-56405065102-33518432330","67847967377-48000963322-62604785301-91415491898-96926520291")<br>,(2,3875126,"38014276128-25250245652-62722561801-27818678124-24890218270-18312424692-92565570600-36243745486-21199862476-38576014630","23183251411-36241541236-31706421314-92007079971-60663066966") |
Now, we are going to use the last column (pad) and the number tag with 11 digits to simulate the values:
|
1 |
`pad`=random_format <number 11>-<number 11>-<number 11>-<number 11>-<number 11> |
We can see that it took 36.6 seconds to complete, and the values in the latest column have changed:
|
1 |
# rm -rf data/; time ./mydumper -o data -B test --defaults-file=mydumper-pad-long.cnf -r 100000 -c<br>real 0m36.667s<br>user 1m3.785s<br>sys 0m32.757s<br># zstdcat data/test.sbtest1.00000.sql.zst | grep INSERT -A10 | head<br>INSERT INTO `sbtest1` VALUES(1,4992833,"83868641912-28773972837-60736120486-75162659906-27563526494-20381887404-41576422241-93426793964-56405065102-33518432330","32720009027-12540600353-41008809903-18811191622-46944507919")<br>,(2,5019684,"38014276128-25250245652-62722561801-27818678124-24890218270-18312424692-92565570600-36243745486-21199862476-38576014630","14761241271-79422723442-42242331639-12424460062-25625932261") |
Take into consideration that 11 digits forced us to execute two times g_random_int, this means that if we have:
|
1 |
`pad`=random_format <number 9>-<number 9>-<number 9>-<number 9>-<number 9> |
It will take 29 seconds.
In this case, the configuration will be:
|
1 |
`pad`=random_format <file words_alpha.txt.100>-<file words_alpha.txt.100>-<file words_alpha.txt.100>-<file words_alpha.txt.100>-<file words_alpha.txt.100> |
And it will take 34 seconds:
|
1 |
# rm -rf data/; time ./mydumper -o data -B test --defaults-file=mydumper-simple-pad.cnf -r 100000 -c<br>real 0m34.224s<br>user 0m56.702s<br>sys 0m29.474s<br># zstdcat data/test.sbtest1.00000.sql.zst | grep INSERT -A10 | head<br>INSERT INTO `sbtest1` VALUES(1,4992833,"83868641912-28773972837-60736120486-75162659906-27563526494-20381887404-41576422241-93426793964-56405065102-33518432330","aam-abacot-abalienated-abandonedly-ab")<br>,(2,5019684,"38014276128-25250245652-62722561801-27818678124-24890218270-18312424692-92565570600-36243745486-21199862476-38576014630","aardwolves-abaised-abandoners-aaronitic-abacterial") |
This is not a fully tested feature in MyDumper; you should consider it as Beta. However, I found it relevant to show the potential that it might have for the community.
Never has it been as easy to build a new masquerade environment as we can do now with MyDumper.
Percona Distribution for MySQL is the most complete, stable, scalable, and secure open-source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!