Downloads

Blog

How to Generate Test Data for MySQL With Python

November 1, 2022

Author

Share this Post:

Generate Test Data for MySQL With Python For testing purposes, especially if you’re working on a project that uses any database technology to store information, you may need data to try out your project. In that case, you have two options:

Find a good dataset (Kaggle) or,

Use a library like Faker

Through this blog post, you will learn how to generate test data for MySQL using Faker.

Requirements

Dependencies

Make sure all the dependencies are installed before creating the Python script that will generate the data for your project.

You can create a requirements.txt file with the following content:

pandas<br>sqlalchemy<br>PyMySQL<br>tqdm<br>faker

1	pandas<br>sqlalchemy<br>PyMySQL<br>tqdm<br>faker

Once you have created this file, run the following command:

pip install -r requirements.txt

1	pip install -r requirements.txt

Or if you’re using Anaconda, create an environment.yml file:

name: percona<br>dependencies:<br>  - python=3.10<br>  - pandas<br>  - sqlalchemy<br>  - PyMySQL<br>  - tqdm<br>  - faker

1	name: percona<br>dependencies:<br> - python=3.10<br> - pandas<br> - sqlalchemy<br> - PyMySQL<br> - tqdm<br> - faker

You can change the Python version as this script has been proven to work with these versions of Python: 3.7, 3.8, 3.9, 3.10, and 3.11.

Run the following statement to configure the project environment:

conda env create -f environment.yml

1	conda env create -f environment.yml

Database

Now that you have the dependencies installed, you must create a database named company.

Log into MySQL:

mysql -u root -p

1	mysql -u root -p

Or log into MySQL using MySQL Shell:

mysqlsh root@localhost

1	mysqlsh root@localhost

Replace root with your username, if necessary, and replace localhost with the IP address or URL for your MySQL server instance if needed.

If using MySQL Shell, change to SQL mode:

sql

sql

and create the company database

create database company;

1	create database company;

Fake data with Faker

Faker is a Python library that can be used to generate fake data through properties defined in the package.

from faker import Faker<br><br>fake = Faker()<br>for _ in range(10):<br>    print(fake.name())

1	from faker import Faker<br><br>fake = Faker()<br>for _ in range(10):<br> print(fake.name())

The above code will print ten names, and on each call to method name(), it will produce a random value. The name() is a property of the generator. Every property of this library is called a fake. and there are many of them packaged in providers.

Some providers and properties available in the Faker library include:

faker.providers.person
- name → John Doe
- first_name → Katherine
- last_name → Chang

faker.providers.address
- address → 791 Crist Parks, Sashabury, IL 86039-9874
- city → Sashabury
- country → Hungary

faker.providers.job
- job → Musician

faker.providers.company
- company → Acme Ltd

faker.providers.internet
- email → [email protected]
- company_email → [email protected]

You can find more information on bundled and community providers in the documentation.

Creating a Pandas DataFrame

After knowing Faker and its properties, a modules directory needs to be created, and inside the directory, we will create a module named dataframe.py. This module will be imported later into our main script, and this is where we define the method that will generate the data.

from multiprocessing import cpu_count<br>import pandas as pd<br>from tqdm import tqdm<br>from faker import Faker

1	from multiprocessing import cpu_count<br>import pandas as pd<br>from tqdm import tqdm<br>from faker import Faker

Multiprocessing is implemented for optimizing the execution time of the script, but this will be explained later. First, you need to import the required libraries:

pandas. Data generated with Faker will be stored in a Pandas DataFrame before being imported into the database.

tqdm(). Required for adding a progress bar to show the progress of the DataFrame creation.

Faker(). It’s the generator from the faker library.

cpu_count(). This is a method from the multiprocessing module that will return the number of cores available.

fake = Faker()<br>num_cores = cpu_count() - 1

1	fake = Faker()<br>num_cores = cpu_count() - 1

Faker() creates and initializes a faker generator, which can generate data by accessing the properties.

num_cores is a variable that stores the value returned after calling the cpu_count() method.

def create_dataframe(arg):<br>    x = int(60000/num_cores)<br>    data = pd.DataFrame()<br>    for i in tqdm(range(x), desc='Creating DataFrame'):<br>        data.loc[i, 'first_name'] = fake.first_name()<br>        data.loc[i, 'last_name'] = fake.last_name()<br>        data.loc[i, 'job'] = fake.job()<br>        data.loc[i, 'company'] = fake.company()<br>        data.loc[i, 'address'] = fake.address()<br>        data.loc[i, 'city'] = fake.city()<br>        data.loc[i, 'country'] = fake.country()<br>        data.loc[i, 'email'] = fake.email()<br>    return data

def create_dataframe(arg): x = int(60000/num_cores) data = pd.DataFrame() for i in tqdm(range(x), desc='Creating DataFrame'): data.loc[i, 'first_name'] = fake.first_name() data.loc[i, 'last_name'] = fake.last_name() data.loc[i, 'job'] = fake.job() data.loc[i, 'company'] = fake.company() data.loc[i, 'address'] = fake.address() data.loc[i, 'city'] = fake.city() data.loc[i, 'country'] = fake.country() data.loc[i, 'email'] = fake.email() return data

Then we define the create_dataframe() function, where:

x is the variable that will determine the number of iterations of the for loop where the DataFrame is created.

data is an empty DataFrame that will later be fulfilled with data generated with Faker.

Pandas DataFrame.loc attribute provides access to a group of rows and columns by their label(s). In each iteration, a row of data is added to the DataFrame and this attribute allows assigning values to each column.

The DataFrame that is created after calling this function will have the following columns:

 #   Column      Non-Null Count  Dtype<br>---  ------      --------------  -----<br> 0   first_name  60000 non-null  object<br> 1   last_name   60000 non-null  object<br> 2   job         60000 non-null  object<br> 3   company     60000 non-null  object<br> 4   address     60000 non-null  object<br> 5   country     60000 non-null  object<br> 6   city        60000 non-null  object<br> 7   email       60000 non-null  object

# Column Non-Null Count Dtype --- ------ -------------- ----- 0 first_name 60000 non-null object 1 last_name 60000 non-null object 2 job 60000 non-null object 3 company 60000 non-null object 4 address 60000 non-null object 5 country 60000 non-null object 6 city 60000 non-null object 7 email 60000 non-null object

Connection to the database

Before inserting the data previously generated with Faker, we need to establish a connection to the database, and for doing this the SQLAlchemy library will be used.

SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL.

from sqlalchemy import create_engine<br>from sqlalchemy.orm import sessionmaker<br><br>engine = create_engine("mysql+pymysql://user:password@localhost/company")<br>Session = sessionmaker(bind=engine)

1	from sqlalchemy import create_engine<br>from sqlalchemy.orm import sessionmaker<br><br>engine = create_engine("mysql+pymysql://user:password@localhost/company")<br>Session = sessionmaker(bind=engine)

From SQLAlchemy, we import the create_engine() and the sessionmaker() methods. The first one is for connecting to the database, and the second is for creating a session bond to the engine object.

Don’t forget to replace the user, password, and localhost with your authentication details, save this code in the modules directory and name it base.py.

From the documentation, SQLAlchemy uses the mysqlclient library by default, but there are other ones available, including PyMySQL.

# default<br>engine = create_engine("mysql://scott:tiger@localhost/foo")<br><br># mysqlclient (a maintained fork of MySQL-Python)<br>engine = create_engine("mysql+mysqldb://scott:tiger@localhost/foo")<br><br># PyMySQL<br>engine = create_engine("mysql+pymysql://scott:tiger@localhost/foo")

1	# default<br>engine = create_engine("mysql://scott:tiger@localhost/foo")<br><br># mysqlclient (a maintained fork of MySQL-Python)<br>engine = create_engine("mysql+mysqldb://scott:tiger@localhost/foo")<br><br># PyMySQL<br>engine = create_engine("mysql+pymysql://scott:tiger@localhost/foo")

According to the maintainer of both mysqlclient and PyMySQL, mysqlclient-python is much faster than PyMySQL, but you should use PyMySQL if:

You can’t use libmysqlclient for some reason

You want to use monkeypatched socket of gevent or eventlet

You want to hack mysql protocol

Database schema definition

The schema of the database can be created through the Schema Definition Language provided by SQLAlchemy, but as we’re only creating one table and importing the DataFrame by calling Pandas to_sql() method, this is not necessary.

When calling to_sql() method, we specify the schema as follows:

from sqlalchemy.types import *<br><br>schema = {<br>    "first_name": String(50),<br>    "last_name": String(50),<br>    "job": String(100),<br>    "company": String(100),<br>    "address": String(200),<br>    "city": String(100),<br>    "country": String(100),<br>    "email": String(50)<br>}

1	from sqlalchemy.types import *<br><br>schema = {<br> "first_name": String(50),<br> "last_name": String(50),<br> "job": String(100),<br> "company": String(100),<br> "address": String(200),<br> "city": String(100),<br> "country": String(100),<br> "email": String(50)<br>}

Then we pass the schema variable as a parameter to this method.

Save this code in the modules directory with the name schema.py.

What is multiprocessing?

Multiprocessing is a Python module that can be used to take advantage of the CPU cores available in the computer where the script is running. In Python, single-CPU use is caused by the global interpreter lock, which allows only one thread to carry the Python interpreter at any given time, for more information see this blog post.

Imagine that you’re generating 60,000 records, running the script in a single core will take more time than you could expect, since each record is generated one by one within the loop. By implementing multiprocessing, the whole process is divided by the number of cores, so that if your CPU has 16 cores, every core will generate 4,000 records, and this is because only 15 cores will be used as we need to leave one available for avoiding freezing the computer.

To understand better how to implement multiprocessing in Python, I recommend the following tutorials:

Parallel For-Loop With a Multiprocessing Pool

Multiprocessing Pool.map() in Python

Generating your data

All the required modules are now ready to be imported into the main script so it’s time to create the sql.py script. First, import the required libraries:

from multiprocessing import Pool<br>from multiprocessing import cpu_count<br>import pandas as pd

1	from multiprocessing import Pool<br>from multiprocessing import cpu_count<br>import pandas as pd

From multiprocessing, Pool() and cpu_count() are required. The Python Multiprocessing Pool class allows you to create and manage process pools in Python.

Then, import the modules previously created:

from modules.dataframe import create_dataframe<br>from modules.schema import schema<br>from modules.base import Session, engine

1	from modules.dataframe import create_dataframe<br>from modules.schema import schema<br>from modules.base import Session, engine

Now we create the multiprocessing pool configured to use all available CPU cores minus one. Each core will call the create_dataframe() function and create a DataFrame with 4,000 records, and after each call to the function has finished, all the DataFrames created will be concatenated into a single one.

if __name__ == "__main__":<br>    num_cores = cpu_count() - 1<br>    with Pool() as pool:<br>        data = pd.concat(pool.map(create_dataframe, range(num_cores)))<br>    data.to_sql(name='employees', con=engine, if_exists = 'append', index=False, dtype=schema)<br>    with engine.connect() as conn:<br>        conn.execute("ALTER TABLE employees ADD id INT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;")

if __name__ == "__main__": num_cores = cpu_count() - 1 with Pool() as pool: data = pd.concat(pool.map(create_dataframe, range(num_cores))) data.to_sql(name='employees', con=engine, if_exists = 'append', index=False, dtype=schema) with engine.connect() as conn: conn.execute("ALTER TABLE employees ADD id INT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;")

And finally, we will insert the DataFrame into MySQL by calling the to_sql() method. All the data will be stored in a table named employees.

By calling conn.execute(), a new column named id will be added to the table, set as the primary key, and placed at the beginning.

Run the following statement to populate the table:

python sql.py

1	python sql.py

It will take just a few seconds to generate the DataFrame with the 60,000 records, and that’s why multiprocessing was implemented.

CPU Utilization on PMM — CPU utilization on Percona Monitoring and Management

Once the script finishes, you can check the data in the database.

use company;<br>select count(*) from employees;

1	use company;<br>select count(*) from employees;

The count() function returns the number of records in the employees table.

+----------+<br>| count(*) |<br>+----------+<br>|    60000 |<br>+----------+<br>1 row in set (0.22 sec)

1	+----------+<br>\| count(*) \|<br>+----------+<br>\| 60000 \|<br>+----------+<br>1 row in set (0.22 sec)

The code shown in this blog post can be found on my GitHub account in the data-generator repository.

0 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Resources

Blog

MySQL

April 21, 2026

Impacts of updates in open-source databases

Blog

Percona

April 21, 2026

Percona Operator for MySQL 1.1.0: PITR, Incremental Backups, and Compression

Blog

MySQL

April 20, 2026

Deploying Cross-Site Replication in Percona Operator for MySQL (PXC)

Far
Enough.

Said no pioneer ever.

Get Started

Open source database software from experts who stand with you in production. Forever free from lock-in and other corporate BS.

Connect

Privacy

Legal

Security Center

MySQL, PostgreSQL, InnoDB, MariaDB, MongoDB and Kubernetes are trademarks for their respective owners.

How to Generate Test Data for MySQL With Python

Requirements

Dependencies

Database

Fake data with Faker

Creating a Pandas DataFrame

Connection to the database

Database schema definition

What is multiprocessing?

Generating your data

Impacts of updates in open-source databases

Percona Operator for MySQL 1.1.0: PITR, Incremental Backups, and Compression

Deploying Cross-Site Replication in Percona Operator for MySQL (PXC)

Far Enough.

Far
Enough.