Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Masquerade Your Backups To Build QA/Testing Environments With MyDumper

April 10, 2023

Author

David Ducos

Insight for DBAs

MySQL

Share this Post:

For a long time, MyDumper has been the fastest tool to take Logical Backups. We have been adding several features to expand the use cases. Masquerade was one of these features, but it was only for integer and UUID values. In this blog post, I’m going to present a new functionality that is available in MyDumper and will be available in the next release: we added the possibility to build random data based on a format that the user defines.

How does it work?

During export, mydumper sends SELECT statements to the database. Each row is written one by one as an INSERT statement. Something important that you might not know, is that each column of a row can be transformed by a function. When you execute a backup, the default function is the identity function, as nothing needs to be changed. The function, which can be configured inside the defaults file, will change the content of the column before writing the row into disk.

How can we select the column to masquerade?

I think that the most valuable element of this feature is the simplicity to define which column will be modified and how you want to mask it. The format is:

[`schema_name`.`table_name`]
`column1`=random_int
`column2`=random_string

[`schema_name`.`table_name`]

`column1`=random_int

`column2`=random_string

In the section name, you add the schema and table name surrounded by backticks and separated by a dot. Then, each key-value entry will keep in the key the column name surrounded by backticks, and the value will be the masking function definition.

New random format function

Having string, integer, and UUID is nice to have, but what about build dynamic data with a specific format? As we want more realistic data, we want to build dynamically world wide addresses, phone numbers, emails, etc. The new function has this syntax:

random_format { <{file|string n|number n}> | DELIMITER | 'CONSTANT' }*

1	random_format { <{file\|string n\|number n}> \| DELIMITER \| 'CONSTANT' }*

This are some examples:

`phone`=random_format '+1 ('<number 3>') '<number 3>'-'<number 4>
`emails`=random_format <file names.txt>'.'<file surnames.txt>'@'<file domains.txt>
`addresses`=random_format <number 3>' '<file streets.txt>', '<file cities.txt>', '<file states_and_zip.txt>', USA'

`phone`=random_format '+1 ('<number 3>') '<number 3>'-'<number 4>

`emails`=random_format <file names.txt>'.'<file surnames.txt>'@'<file domains.txt>

`addresses`=random_format <number 3>' '<file streets.txt>', '<file cities.txt>', '<file states_and_zip.txt>', USA'

Performance considerations

You should expect performance degradation if you compare masquerade backups and regular backups. It is impossible to measure the impact as it will depend on the amount of data that needs to be masked. However, I tried to give you an idea through an example over a sysbench table of 10M rows.

Baseline backup

We are going to split by rows and compress with ZSTD:

# rm -rf data/; time ./mydumper -o data -B test --defaults-file=mydumper.cnf -r 100000 -c
real 0m19.964s
user 0m48.396s
sys 0m7.885s

# rm -rf data/; time ./mydumper -o data -B test --defaults-file=mydumper.cnf -r 100000 -c

real 0m19.964s

user 0m48.396s

sys 0m7.885s

It took near 19.9 seconds to complete, and here is an example of the output:

# zstdcat data/test.sbtest1.00000.sql.zst | grep INSERT -A10 | head
INSERT INTO `sbtest1` VALUES(1,4992833,"83868641912-28773972837-60736120486-75162659906-27563526494-20381887404-41576422241-93426793964-56405065102-33518432330","67847967377-48000963322-62604785301-91415491898-96926520291")
,(2,5019684,"38014276128-25250245652-62722561801-27818678124-24890218270-18312424692-92565570600-36243745486-21199862476-38576014630","23183251411-36241541236-31706421314-92007079971-60663066966")

# zstdcat data/test.sbtest1.00000.sql.zst | grep INSERT -A10 | head

INSERT INTO `sbtest1` VALUES(1,4992833,"83868641912-28773972837-60736120486-75162659906-27563526494-20381887404-41576422241-93426793964-56405065102-33518432330","67847967377-48000963322-62604785301-91415491898-96926520291")

,(2,5019684,"38014276128-25250245652-62722561801-27818678124-24890218270-18312424692-92565570600-36243745486-21199862476-38576014630","23183251411-36241541236-31706421314-92007079971-60663066966")

One integer column

We are going to use random_int over the k column, which in the configuration will be:

[`test`.`sbtest1`]
`k`=random_int

1 2	[`test`.`sbtest1`] `k`=random_int

The backup took 20.7 seconds, an increase of 4%:

# rm -rf data/; time ./mydumper -o data -B test --defaults-file=mydumper-k.cnf -r 100000 -c
real 0m20.709s
user 0m46.056s
sys 0m11.247s

# rm -rf data/; time ./mydumper -o data -B test --defaults-file=mydumper-k.cnf -r 100000 -c

real 0m20.709s

user 0m46.056s

sys 0m11.247s

And as you can see, the data in the second column has changed:

# zstdcat data/test.sbtest1.00000.sql.zst | grep INSERT -A10 | head
INSERT INTO `sbtest1` VALUES(1,1527173,"83868641912-28773972837-60736120486-75162659906-27563526494-20381887404-41576422241-93426793964-56405065102-33518432330","67847967377-48000963322-62604785301-91415491898-96926520291")
,(2,3875126,"38014276128-25250245652-62722561801-27818678124-24890218270-18312424692-92565570600-36243745486-21199862476-38576014630","23183251411-36241541236-31706421314-92007079971-60663066966")

# zstdcat data/test.sbtest1.00000.sql.zst | grep INSERT -A10 | head

INSERT INTO `sbtest1` VALUES(1,1527173,"83868641912-28773972837-60736120486-75162659906-27563526494-20381887404-41576422241-93426793964-56405065102-33518432330","67847967377-48000963322-62604785301-91415491898-96926520291")

,(2,3875126,"38014276128-25250245652-62722561801-27818678124-24890218270-18312424692-92565570600-36243745486-21199862476-38576014630","23183251411-36241541236-31706421314-92007079971-60663066966")

random_format with <number 11>

Now, we are going to use the last column (pad) and the number tag with 11 digits to simulate the values:

`pad`=random_format <number 11>-<number 11>-<number 11>-<number 11>-<number 11>

1	`pad`=random_format <number 11>-<number 11>-<number 11>-<number 11>-<number 11>

We can see that it took 36.6 seconds to complete, and the values in the latest column have changed:

# rm -rf data/; time ./mydumper -o data -B test --defaults-file=mydumper-pad-long.cnf -r 100000 -c
real 0m36.667s
user 1m3.785s
sys 0m32.757s
# zstdcat data/test.sbtest1.00000.sql.zst | grep INSERT -A10 | head
INSERT INTO `sbtest1` VALUES(1,4992833,"83868641912-28773972837-60736120486-75162659906-27563526494-20381887404-41576422241-93426793964-56405065102-33518432330","32720009027-12540600353-41008809903-18811191622-46944507919")
,(2,5019684,"38014276128-25250245652-62722561801-27818678124-24890218270-18312424692-92565570600-36243745486-21199862476-38576014630","14761241271-79422723442-42242331639-12424460062-25625932261")

# rm -rf data/; time ./mydumper -o data -B test --defaults-file=mydumper-pad-long.cnf -r 100000 -c

real 0m36.667s

user 1m3.785s

sys 0m32.757s

# zstdcat data/test.sbtest1.00000.sql.zst | grep INSERT -A10 | head

INSERT INTO `sbtest1` VALUES(1,4992833,"83868641912-28773972837-60736120486-75162659906-27563526494-20381887404-41576422241-93426793964-56405065102-33518432330","32720009027-12540600353-41008809903-18811191622-46944507919")

,(2,5019684,"38014276128-25250245652-62722561801-27818678124-24890218270-18312424692-92565570600-36243745486-21199862476-38576014630","14761241271-79422723442-42242331639-12424460062-25625932261")

Take into consideration that 11 digits forced us to execute two times g_random_int, this means that if we have:

`pad`=random_format <number 9>-<number 9>-<number 9>-<number 9>-<number 9>

1	`pad`=random_format <number 9>-<number 9>-<number 9>-<number 9>-<number 9>

It will take 29 seconds.

random_format with <file> with 100 lines file

In this case, the configuration will be:

`pad`=random_format <file words_alpha.txt.100>-<file words_alpha.txt.100>-<file words_alpha.txt.100>-<file words_alpha.txt.100>-<file words_alpha.txt.100>

1	`pad`=random_format <file words_alpha.txt.100>-<file words_alpha.txt.100>-<file words_alpha.txt.100>-<file words_alpha.txt.100>-<file words_alpha.txt.100>

And it will take 34 seconds:

# rm -rf data/; time ./mydumper -o data -B test --defaults-file=mydumper-simple-pad.cnf -r 100000 -c
real 0m34.224s
user 0m56.702s
sys 0m29.474s
# zstdcat data/test.sbtest1.00000.sql.zst | grep INSERT -A10 | head
INSERT INTO `sbtest1` VALUES(1,4992833,"83868641912-28773972837-60736120486-75162659906-27563526494-20381887404-41576422241-93426793964-56405065102-33518432330","aam-abacot-abalienated-abandonedly-ab")
,(2,5019684,"38014276128-25250245652-62722561801-27818678124-24890218270-18312424692-92565570600-36243745486-21199862476-38576014630","aardwolves-abaised-abandoners-aaronitic-abacterial")

# rm -rf data/; time ./mydumper -o data -B test --defaults-file=mydumper-simple-pad.cnf -r 100000 -c

real 0m34.224s

user 0m56.702s

sys 0m29.474s

# zstdcat data/test.sbtest1.00000.sql.zst | grep INSERT -A10 | head

,(2,5019684,"38014276128-25250245652-62722561801-27818678124-24890218270-18312424692-92565570600-36243745486-21199862476-38576014630","aardwolves-abaised-abandoners-aaronitic-abacterial")

Warning

This is not a fully tested feature in MyDumper; you should consider it as Beta. However, I found it relevant to show the potential that it might have for the community.

Conclusion

Never has it been as easy to build a new masquerade environment as we can do now with MyDumper.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure open-source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

Try Percona Distribution for MySQL today!