Loading Air Traffic Control Data with TokuDB 4.1.1Rich.Prohaska
TokuDB has a big advantage over B-trees when trickle loading data into existing tables. However, it is possible to preprocess the data when bulk loading into empty tables or when new indexes are created. TokuDB release 4 now uses a parallel algorithm to speed up these types of bulk insertions. How does the parallel loader performance compare with the serial loader? We use the Air Traffic Control (ATC) data and queries described in a Percona blog and also used in an experiment with TokuDB 2.1.0 to gain some insight.
Our ATC data is about 122M rows in size, is stored in a 40GiB CSV file, and can be found in our Amazon S3 public bucket. See the end of this blog for details. We use a table schema with 8 indices to speed up the ATC queries. The loader inserts the data into the primary fractal tree and one fractal tree for each of the 8 keys.
The load was run on two (old) machines:
- 2 core: Intel Core 2 Duo E8500 3.16GHz, 5GiB RAM, 1 disk SATA 1TB.
- 8 core: 2 socket quad core Xeon X5460 3.16GHz, 16GiB RAM, 6 disk SAS 1TB RAID0.
- We use “LOAD DATA INFILE” to load the CSV file into a MySQL table. The CSV file can be found in our Amazon S3 public bucket.
- We see over 4x load time speed up for this database. This compares favorably with the other loaders. The next steps are to run experiments on machines with a larger number of cores, identify bottlenecks, and remove them.
- BTW, the ATC load times into MyISAM are roughly the same as seen with the TokuDB 2.1.0 serial loader. Perhaps loads into MyISAM can be made faster with some twiddling of the MySQL system variables.
|Load Times for the ATC Database|
|TokuDB 2.1.0 and MySQL 5.1.36||TokuDB 4.1.1 and MySQL 5.1.46||TokuDB Speedup|
TokuDB data sizes (including indices):
- 6.7 GiB, or almost 6 times smaller than the CSV source file. The parallel loader uses the same compression parameters as the serial loader, so no change in size occurred.
- TokuDB 4 uses the standard MySQL handler row iterator, so the query times are roughly unchanged. In addition, the query handling is not yet parallelized, so there is not much difference in query times on the 2 core and 8 core machines.
- BTW, the query times for MyISAM are slightly longer than TokuDB.
- The queries can be found in our Amazon S3 public bucket.
|Query Times for the ATC Database|
|TokuDB 2.1.0||TokuDB 4.1.1|
- Loader scalability on systems with larger number of cores. Machines with 48 cores (4 socket, 12 cores per socket) are now reasonably priced.
- How we implemented the parallel loader: algorithms and parallel runtime.
The ATC CSV data files, the schema, and the queries can be retrieved from our public Amazon S3 bucket called
tokutek-pub. Here are the Amazon S3 keys: