Loading Air Traffic Control Data with TokuDB 4.1.1

TokuDB has a big advantage over B-trees when trickle loading data into existing tables. However, it is possible to preprocess the data when bulk loading into empty tables or when new indexes are created. TokuDB release 4 now uses a parallel algorithm to speed up these types of bulk insertions. How does the parallel loader performance compare with the serial loader? We use the Air Traffic Control (ATC) data and queries described in a Percona blog and also used in an experiment with TokuDB 2.1.0 to gain some insight.

Our ATC data is about 122M rows in size, is stored in a 40GiB CSV file, and can be found in our Amazon S3 public bucket. See the end of this blog for details. We use a table schema with 8 indices to speed up the ATC queries. The loader inserts the data into the primary fractal tree and one fractal tree for each of the 8 keys.

The load was run on two (old) machines:

  • 2 core: Intel Core 2 Duo E8500 3.16GHz, 5GiB RAM, 1 disk SATA 1TB.
  • 8 core: 2 socket quad core Xeon X5460 3.16GHz, 16GiB RAM, 6 disk SAS 1TB RAID0.

Load times:

    Load Times for the ATC Database
    TokuDB 2.1.0 and MySQL 5.1.36TokuDB 4.1.1 and MySQL 5.1.46TokuDB Speedup
    2 core10,974s5,201s2.1x
    8 core11,286s2,655s4.2x
  • We use “LOAD DATA INFILE” to load the CSV file into a MySQL table. The CSV file can be found in our Amazon S3 public bucket.
  • We see over 4x load time speed up for this database. This compares favorably with the other loaders. The next steps are to run experiments on machines with a larger number of cores, identify bottlenecks, and remove them.
  • BTW, the ATC load times into MyISAM are roughly the same as seen with the TokuDB 2.1.0 serial loader. Perhaps loads into MyISAM can be made faster with some twiddling of the MySQL system variables.

TokuDB data sizes (including indices):

  • 6.7 GiB, or almost 6 times smaller than the CSV source file. The parallel loader uses the same compression parameters as the serial loader, so no change in size occurred.

Query times:

    Query Times for the ATC Database
    TokuDB 2.1.0TokuDB 4.1.1
  • TokuDB 4 uses the standard MySQL handler row iterator, so the query times are roughly unchanged. In addition, the query handling is not yet parallelized, so there is not much difference in query times on the 2 core and 8 core machines.
  • BTW, the query times for MyISAM are slightly longer than TokuDB.
  • The queries can be found in our Amazon S3 public bucket.

Future blogs:

  • Loader scalability on systems with larger number of cores. Machines with 48 cores (4 socket, 12 cores per socket) are now reasonably priced.
  • How we implemented the parallel loader: algorithms and parallel runtime.

The ATC CSV data files, the schema, and the queries can be retrieved from our public Amazon S3 bucket called
tokutek-pub.  Here are the Amazon S3 keys:

  • atc_On_Time_Performance.mysql.csv.gz.aa
  • atc_On_Time_Performance.mysql.csv.gz.ab
  • atc_On_Time_Performance.mysql.csv.gz.ac
  • atc_On_Time_Performance.mysql.csv.gz.ad
  • atc_On_Time_Performance.mysql.csv.gz.xml
  • atc_ontime_create_covered.sql
  • atc_ontime_create_covered.sql.xml
  • atc_ontime_queries.tar.gz
  • atc_ontime_queries.tar.gz.xml
  • s3get
  • s3get.xml

Share this post

Comments (7)

  • Jeremy Cole Reply


    Just a minor nit-pick: This is not ATC data, it is from the Bureau of Transportation Statistics (BTS) and it is self-reported by commercial airlines over a certain size. It has nothing to do with Air Traffic Control at all, it is designed only for analyzing the performance of air traffic as a transportation mechanism.



    August 27, 2010 at 2:10 pm
  • Justin Swanhart Reply

    Does the loader produce binary logs?

    August 27, 2010 at 3:31 pm
  • Rich Prohaska Reply

    The tokudb loader operates under the storage engine handler’s start_bulk_insert, write_row, and end_bulk_insert methods. For index creation, the loader operates under the add_index method. Because of this, MySQL is able to do the binary logging. The tokudb loader is part of our storage engine, not a separate program.

    August 28, 2010 at 8:16 am
  • hadar Reply

    You said the parallel loader is good for ” bulk loading into empty tables “,
    what happens when I do bulk loading using “load data infile into” for an existing table with records? the parallel loader will not work?

    September 15, 2010 at 12:43 pm
  • zardosht Reply

    @hadar, I wrote a post on your question: http://tokutek.com/2010/09/scenarios-where-tokudbs-loader-is-used/. If you use “load data infile” on a table with records already in it, then the “bulk load into an empty table” algorithm is not used, and normal insertions are done. Keep in mind that these normal insertions can be up to two orders of magnitude faster than insertions into a B-tree based storage engine.

    September 17, 2010 at 8:07 am
  • Martin Farach-Colton Reply

    Fractal trees are both relatively new and proprietary to Tokutek.

    November 17, 2010 at 11:30 pm

Leave a Reply