Predicting how long data load would takePeter Zaitsev
I had this question asked many times during last week, and there is really no easy answer. There are just way too many variables to consider especially while loading large schemas with a lot of different table structures. So which variables affect the load speed:
Table Structure This one is tricky. The shorter rows you have the faster load will normally be, if you measure rows, but if you measure in amount of loaded data longer rows are faster.
Indexes The more indexes you have the longer load would take. Details depend on storage engines, for example MyISAM can build non-unique indexes by sort which is very fast and Innodb can use insert buffer for them.
Data Insert Order It is well known it is important to load data in primary key order into Innodb tables for optimal performance, but effect does not end here. If indexes are built by inserts in Btree order in which entries are inserted in the index can affect performance dramatically.
Table Size This is the worst trouble maker. Typically load starts fast but as it progresses it may slow down dramatically, typically if Index BTREE has to be built by insertion and data does not fit in memory any more. I’ve seen cases when load speed starts at 30.000 rows/sec and then goes down to less than 100 rows/sec which can kill your initial estimates and can cause much longer downtime than planned.
Hardware Configuration If you need to load data quick do it on decent hardware. Especially memory size is important. If you need database on some small box it may be faster to load it on the more powerful box and copy it back in a binary form.
MySQL Configuration Different storage engines have different settings which need to be set for optimal load speed. Depending on load type MyISAM may benefit from bulk_insert_tree_size increase myisam_sort_buffer_size or key_buffer_size increase. Innodb typically needs large innodb_buffer_pool_size and large innodb_log_file_size to perform load effectively
Load Options There two main ways to load data in MySQL you can use Multiple value insert (standard mysqldump output) or LOAD DATA INFILE (–tab mysqldump output). Generally LOAD DATA can be optimized better and a bit faster because of easier parsing. Mysqldump however takes care of special options to make multi value inserts as fast as possible as well, such as disabling indexes before the load and enabling them back – helpful to make sure MyISAM build indexes via sorting not by BTREE insertion. For Innodb SET UNIQUE_CHECKS=0 may be used to speed up load to tables with unique keys.
So as you can see there are many variables which affect load speed which makes it extremely hard to predict. Especially fact load can slow down dramatically as amount of data loaded in the table increased and the fact this depends not only on table structure but the data itself (it defines the insertion order for the indexes) cause the challenges.
If you have some particular history about loading similar data of similar size on similar hardware use that as estimate. If not use your best guess based on what other cases were “similar”
As the data load goes I usually tend to do something like “du -sh; sleep 3600; du -sh” in the database data directory to see how much data is loaded per hour (assuming your Innodb tablespace did not have free space or you’re using innodb_file_per_table=1). Do not expect the value to be constant though. During our recent data load speed we for example could see data load speed of 60GB during one hour and 10GB during another depending on the tables which were loaded.
During our recent data load we loaded almost 1TB of Innodb data within 24 hours. It was decent speed because we had tables partitioned and so no table was more than 15GB in size (allowing to fit all BTREE indexes in memory) plus we loaded data in parallel and our rows were relatively long. In most cases I would expect load speed to be a bit slower, sometimes much slower.
I’ve seen cases when loading 100-200GB of data in single table was taking a week due to amount of indexes and the fact they poorly fit in memory.