EmergencyEMERGENCY? Get 24/7 Help Now!

Loading Air Traffic Control Data with TokuDB 4.1.1


Posted on:

|

By:


PREVIOUS POST
NEXT POST
Share Button

TokuDB has a big advantage over B-trees when trickle loading data into existing tables. However, it is possible to preprocess the data when bulk loading into empty tables or when new indexes are created. TokuDB release 4 now uses a parallel algorithm to speed up these types of bulk insertions. How does the parallel loader performance compare with the serial loader? We use the Air Traffic Control (ATC) data and queries described in a Percona blog and also used in an experiment with TokuDB 2.1.0 to gain some insight.

Our ATC data is about 122M rows in size, is stored in a 40GiB CSV file, and can be found in our Amazon S3 public bucket. See the end of this blog for details. We use a table schema with 8 indices to speed up the ATC queries. The loader inserts the data into the primary fractal tree and one fractal tree for each of the 8 keys.

The load was run on two (old) machines:

  • 2 core: Intel Core 2 Duo E8500 3.16GHz, 5GiB RAM, 1 disk SATA 1TB.
  • 8 core: 2 socket quad core Xeon X5460 3.16GHz, 16GiB RAM, 6 disk SAS 1TB RAID0.

Load times:

    Load Times for the ATC Database
    TokuDB 2.1.0 and MySQL 5.1.36 TokuDB 4.1.1 and MySQL 5.1.46 TokuDB Speedup
    2 core 10,974s 5,201s 2.1x
    8 core 11,286s 2,655s 4.2x
  • We use “LOAD DATA INFILE” to load the CSV file into a MySQL table. The CSV file can be found in our Amazon S3 public bucket.
  • We see over 4x load time speed up for this database. This compares favorably with the other loaders. The next steps are to run experiments on machines with a larger number of cores, identify bottlenecks, and remove them.
  • BTW, the ATC load times into MyISAM are roughly the same as seen with the TokuDB 2.1.0 serial loader. Perhaps loads into MyISAM can be made faster with some twiddling of the MySQL system variables.

TokuDB data sizes (including indices):

  • 6.7 GiB, or almost 6 times smaller than the CSV source file. The parallel loader uses the same compression parameters as the serial loader, so no change in size occurred.

Query times:

    Query Times for the ATC Database
    TokuDB 2.1.0 TokuDB 4.1.1
    Q0 22s 28s
    Q1 67s 33s
    Q2 20s 24s
    Q3 54s 29s
    Q4 2s 1s
    Q5 11s 7s
    Q6 35s 29s
    Q7 32s 39s
    Q8.1y 7s 4s
    Q8.2y 42s 32s
    Q8.3y 37s 33s
    Q8.4y 38s 35s
    Q8.10y 139s 56s
    Q9 36s 35s
  • TokuDB 4 uses the standard MySQL handler row iterator, so the query times are roughly unchanged. In addition, the query handling is not yet parallelized, so there is not much difference in query times on the 2 core and 8 core machines.
  • BTW, the query times for MyISAM are slightly longer than TokuDB.
  • The queries can be found in our Amazon S3 public bucket.

Future blogs:

  • Loader scalability on systems with larger number of cores. Machines with 48 cores (4 socket, 12 cores per socket) are now reasonably priced.
  • How we implemented the parallel loader: algorithms and parallel runtime.

The ATC CSV data files, the schema, and the queries can be retrieved from our public Amazon S3 bucket called
tokutek-pub.  Here are the Amazon S3 keys:

  • atc_On_Time_Performance.mysql.csv.gz.aa
  • atc_On_Time_Performance.mysql.csv.gz.ab
  • atc_On_Time_Performance.mysql.csv.gz.ac
  • atc_On_Time_Performance.mysql.csv.gz.ad
  • atc_On_Time_Performance.mysql.csv.gz.xml
  • atc_ontime_create_covered.sql
  • atc_ontime_create_covered.sql.xml
  • atc_ontime_queries.tar.gz
  • atc_ontime_queries.tar.gz.xml
  • s3get
  • s3get.xml
Share Button
PREVIOUS POST
NEXT POST


Tags:

, , ,

Categories:
Tokutek, TokuView


Comments

Leave a Reply

Percona’s widely read Percona Data Performance blog highlights our expertise in enterprise-class software, support, consulting and managed services solutions for both MySQL® and MongoDB® across traditional and cloud-based platforms. The decades of experience represented by our consultants is found daily in numerous and relevant blog posts.

Besides specific database help, the blog also provides notices on upcoming events and webinars.
Want to get weekly updates listing the latest blog posts? Subscribe to our blog now! Submit your email address below and we’ll send you an update every Friday at 1pm ET.

No, thank you. Please do not ask me again.