Importing big tables with large indexes with Myloader MySQL toolDavid Ducos
Mydumper is known as the faster (much faster) mysqldump alternative. So, if you take a logical backup you will choose Mydumper instead of mysqldump. But what about the restore? Well, who needs to restore a logical backup? It takes ages! Even with Myloader. But this could change just a bit if we are able to take advantage of Fast Index Creation.
As you probably know, Mydumper and mysqldump export the struct of a table, with all the indexes and the constraints, and of course, the data. Then, Myloader and MySQL import the struct of the table and import the data. The most important difference is that you can configure Myloader to import the data using a certain amount of threads. The import steps are:
- Create the complete struct of the table
- Import the data
When you execute Myloader, internally it first creates the tables executing the “-schema.sql” files and then takes all the filenames without “schema.sql” and puts them in a task queue. Every thread takes a filename from the queue, which actually is a chunk of the table, and executes it. When finished it takes another chunk from the queue, but if the queue is empty it just ends.
This import procedure works fast for small tables, but with big tables with large indexes the inserts are getting slower caused by the overhead of insert the new values in secondary indexes. Another way to import the data is:
- Split the table structure into table creation with primary key, indexes creation and constraint creation
- Create tables with primary key
- Per table do:
- Load the data
- Create index
- Create constraints
This import procedure is implemented in a branch of Myloader that can be downloaded from here or directly executing bzr with the repository:
bzr branch lp:~david-ducos/mydumper/mydumper
The tool reads the schema files and splits them into three separate statements which create the tables with the primary key, the indexes and the constraints. The primary key is kept in the table creation in order to avoid the recreation of the table when a primary key is added and the “KEY” and “CONSTRAINT” lines are removed. These lines are added to the index and constraint statements, respectively.
It processes tables according to their size starting with the largest because creating the indexes of a big table could take hours and is single-threaded. While we cannot process other indexes at the time, we are potentially able to create other tables with the remaining threads.
It has a new thread (monitor_process) that decides which chunk of data will be put in the task queue and a communication queue which is used by the task processes to tell the monitor_process which chunk has been completed.
I run multiple imports on an AWS m1.xlarge machine with one table comparing Myloader and this branch and I found that with large indexes the times were:
As you can see, when you have less than 150M rows, import the data and then create the indexes is higher than import the table with the indexes all at once. But everything changes after 150M rows, import 200M takes 64 minutes more for Myloader but just 24 minutes for the new branch.
On a table of 200M rows with a integer primary key and 9 integer columns, you will see how the time increases as the index gets larger:
2-2-0: two 1-column and two 2-column index
2-2-1: two 1-column, two 2-column and one 3-column index
2-3-1: two 1-column, three 2-column and one 3-column index
2-3-2: two 1-column, three 2-column and two 3-column index
This branch can only import all the tables with this same strategy, but with this new logic in Myloader, in a future version it could be able to import each table with the best strategy reducing the time of the restore considerably.