Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Air traffic queries in InfiniDB: early alpha

November 2, 2009

Author

Vadim Tkachenko

Benchmarks

MySQL

Share this Post:

As Calpont announced availability of InfiniDB I surely couldn’t miss a chance to compare it with previously tested databases in the same environment.
See my previous posts on this topic:
Analyzing air traffic performance with InfoBright and MonetDB
Air traffic queries in LucidDB

I could not run all queries against InfiniDB and I met some hiccups during my experiment, so it was less plain experience than with other databases.

So let’s go by the same steps:

Load data

InfiniDB supports MySQL’s

LOAD DATA

LOAD DATA

statement and it’s own

colxml / cpimport

1	colxml / cpimport

utilities. As

LOAD DATA

LOAD DATA

is more familiar for me, I started with that, however after issuing LOAD DATA on 180MB file ( for 1989 year, 1st month) very soon it caused extensive swapping (my box has 4GB of RAM) and statement failed with

ERROR 1 (HY000) at line 1: CAL0001: Insert Failed:  St9bad_alloc

1	ERROR 1 (HY000) at line 1: CAL0001: Insert Failed: St9bad_alloc

Alright,

colxml / cpimport

1	colxml / cpimport

was more successful, however it has less flexibility in syntax than

LOAD DATA

LOAD DATA

, so I had to transform the input files into a format that

cpimport

cpimport

could understand.

Total load time was 9747 sec or 2.7h (not counting time spent on files transformation)

I put summary data into on load data time, datasize and query time to Google Spreadsheet so you can easy compare with previous results. There are different sheets for queries, datasize and time of load.

Datasize

Size of database after loading is another confusing point. InfiniDB data directory has complex structure like

./000.dir/000.dir/003.dir/233.dir
./000.dir/000.dir/003.dir/233.dir/000.dir
./000.dir/000.dir/003.dir/233.dir/000.dir/FILE000.cdf
./000.dir/000.dir/003.dir/241.dir
./000.dir/000.dir/003.dir/241.dir/000.dir
./000.dir/000.dir/003.dir/241.dir/000.dir/FILE000.cdf
./000.dir/000.dir/003.dir/238.dir
./000.dir/000.dir/003.dir/238.dir/000.dir
./000.dir/000.dir/003.dir/238.dir/000.dir/FILE000.cdf
./000.dir/000.dir/003.dir/235.dir
./000.dir/000.dir/003.dir/235.dir/000.dir
./000.dir/000.dir/003.dir/235.dir/000.dir/FILE000.cdf

./000.dir/000.dir/003.dir/233.dir

./000.dir/000.dir/003.dir/233.dir/000.dir

./000.dir/000.dir/003.dir/233.dir/000.dir/FILE000.cdf

./000.dir/000.dir/003.dir/241.dir

./000.dir/000.dir/003.dir/241.dir/000.dir

./000.dir/000.dir/003.dir/241.dir/000.dir/FILE000.cdf

./000.dir/000.dir/003.dir/238.dir

./000.dir/000.dir/003.dir/238.dir/000.dir

./000.dir/000.dir/003.dir/238.dir/000.dir/FILE000.cdf

./000.dir/000.dir/003.dir/235.dir

./000.dir/000.dir/003.dir/235.dir/000.dir

./000.dir/000.dir/003.dir/235.dir/000.dir/FILE000.cdf

so it’s hard to day what files are related to table. But after load, the size of 000.dir is 114G, which is as twice big as original data files. SHOW TABLE STATUS does not really help there, it shows

           Name: ontime
         Engine: InfiniDB
        Version: 10
     Row_format: Dynamic
           Rows: 2000
 Avg_row_length: 0
    Data_length: 0
Max_data_length: 0
   Index_length: 0
      Data_free: 0
 Auto_increment: NULL
    Create_time: NULL
    Update_time: NULL
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options: 
        Comment:

Name: ontime

Engine: InfiniDB

Version: 10

Row_format: Dynamic

Rows: 2000

Avg_row_length: 0

Data_length: 0

Max_data_length: 0

Index_length: 0

Data_free: 0

Auto_increment: NULL

Create_time: NULL

Update_time: NULL

Check_time: NULL

Collation: latin1_swedish_ci

Checksum: NULL

Create_options:

Comment:

with totally misleading information.

So I put 114GB as size of data after load, until someone points me how to get real size, and also explains what takes so much space.

Queries

First count start query

SELECT count(*) FROM ontime

1	SELECT count(*) FROM ontime

took 2.67 sec, which shows that InfiniDB does not store counter of records, however calculates it pretty fast.

Q0:

select avg(c1) from (select year,month,count(*) as c1 from ontime group by YEAR,month) t;

1	select avg(c1) from (select year,month,count(*) as c1 from ontime group by YEAR,month) t;

Another bumper, on this query InfiniDB complains

ERROR 138 (HY000): 
The query includes syntax that is not supported by InfiniDB. Use 'show warnings;' to get more information. Review the Calpont InfiniDB Syntax guide for additional information on supported distributed syntax or consider changing the InfiniDB Operating Mode (infinidb_vtable_mode).
mysql> show warnings;
+-------+------+------------------------------------------------------------+
| Level | Code | Message                                                    |
+-------+------+------------------------------------------------------------+
| Error | 9999 | Subselect in From clause is not supported in this release. |
+-------+------+------------------------------------------------------------+

ERROR 138 (HY000):

The query includes syntax that is not supported by InfiniDB. Use 'show warnings;' to get more information. Review the Calpont InfiniDB Syntax guide for additional information on supported distributed syntax or consider changing the InfiniDB Operating Mode (infinidb_vtable_mode).

mysql> show warnings;

+-------+------+------------------------------------------------------------+

| Level | Code | Message |

+-------+------+------------------------------------------------------------+

| Error | 9999 | Subselect in From clause is not supported in this release. |

+-------+------+------------------------------------------------------------+

Ok, so InfiniDB does not support DERIVED TABLES, which is big limitation from my point of view.
As workaround I tried to create temporary table, but got another error:

mysql> create temporary table tq2 as (select Year,Month,count(*) as c1 from ontime group by Year, Month);
ERROR 122 (HY000): Cannot open table handle for ontime.

1 2	mysql> create temporary table tq2 as (select Year,Month,count(*) as c1 from ontime group by Year, Month); ERROR 122 (HY000): Cannot open table handle for ontime.

As warning suggests I turned

infinidb_vtable_mode = 2

1	infinidb_vtable_mode = 2

, which is:

2) auto-switch mode: InfiniDB will attempt to process the query internally, if it 
cannot, it will automatically switch the query to run in row-by-row mode.

1 2	2) auto-switch mode: InfiniDB will attempt to process the query internally, if it cannot, it will automatically switch the query to run in row-by-row mode.

but query took 667 sec :

so I skip queries Q5, Q6, Q7 from consideration, which are also based on DERIVED TABLES, as not supported by InfiniDB.

Other queries: (again look on comparison with other engines in Google Spreadsheet or in summary table at the bottom)

Query Q1:

mysql> SELECT DayOfWeek, count(*) AS c FROM ontime WHERE Year BETWEEN 2000 AND 2008 GROUP BY DayOfWeek ORDER BY c DESC;

1	mysql> SELECT DayOfWeek, count(*) AS c FROM ontime WHERE Year BETWEEN 2000 AND 2008 GROUP BY DayOfWeek ORDER BY c DESC;

7 rows in set (6.79 sec)

Query Q2:

mysql> SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year BETWEEN 2000 AND 2008 GROUP BY DayOfWeek ORDER BY c DESC;

1	mysql> SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year BETWEEN 2000 AND 2008 GROUP BY DayOfWeek ORDER BY c DESC;

7 rows in set (4.59 sec)

Query Q3:

SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year BETWEEN 2000 AND 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10;

1	SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year BETWEEN 2000 AND 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10;

4.96 sec

Query Q4:

mysql> SELECT Carrier, count(*) FROM ontime WHERE DepDelay > 10 AND YearD=2007 GROUP BY Carrier ORDER BY 2 DESC;

1	mysql> SELECT Carrier, count(*) FROM ontime WHERE DepDelay > 10 AND YearD=2007 GROUP BY Carrier ORDER BY 2 DESC;

I had another surprise with query, after 15 min it did not return results, I check system and it was totally idle, but query stuck. I killed query, restarted mysqld but could not connect to mysqld anymore. In processes I see that InfiniDB started couple external processes:

ExeMgr, DDLProc, PrimProc, controllernode fg, workernode DBRM_Worker1 fg

1	ExeMgr, DDLProc, PrimProc, controllernode fg, workernode DBRM_Worker1 fg

which cooperate each with other using IPC shared memory and semaphores. To clean system I rebooted server, and only after that mysqld was able to start.

After that query Q4 took 0.75 sec

Queries Q5-Q7 skipped.

Query Q8:

SELECT DestCityName, COUNT( DISTINCT OriginCityName) FROM ontime WHERE YearD BETWEEN 2008 and 2008 GROUP BY DestCityName ORDER BY 2 DESC LIMIT 10;

1	SELECT DestCityName, COUNT( DISTINCT OriginCityName) FROM ontime WHERE YearD BETWEEN 2008 and 2008 GROUP BY DestCityName ORDER BY 2 DESC LIMIT 10;

And times for InfiniDB:

1y: 8.13 sec
2y: 16.54 sec
3y: 24.46 sec
4y: 32.49 sec
10y: 1 min 10.35 sec

Query Q9:

Q9:

select Year ,count(*) as c1 from ontime group by Year;

1	select Year ,count(*) as c1 from ontime group by Year;

Time: 9.54 sec

Ok, so there is summary table with queries times (in sec, less is better)

Query	MonetDB	InfoBright	LucidDB	InfiniDB
Q0	29.9	4.19	103.21	NA
Q1	7.9	12.13	49.17	6.79
Q2	0.9	6.73	27.13	4.59
Q3	1.7	7.29	27.66	4.96
Q4	0.27	0.99	2.34	0.75
Q5	0.5	2.92	7.35	NA
Q6	12.5	21.83	78.42	NA
Q7	27.9	8.59	106.37	NA
Q8 (1y)	0.55	1.74	6.76	8.13
Q8 (2y)	1.1	3.68	28.82	16.54
Q8 (3y)	1.69	5.44	35.37	24.46
Q8 (4y)	2.12	7.22	41.66	32.49
Q8 (10y)	29.14	17.42	72.67	70.35
Q9	6.3	0.31	76.12	9.54

Conclusions

- InfiniDB server version shows
  
  Server version: 5.1.39-community InfiniDB Community Edition 0.9.4.0-5-alpha (GPL)
  
  1
  
  Server version: 5.1.39-community InfiniDB Community Edition 0.9.4.0-5-alpha (GPL)
  
  , so I consider it as alpha release, and it is doing OK for alpha. I will wait for more stable release for further tests, as it took good amount of time to deal with different glitches.

- InfiniDB shows really good time for queries it can handle, quite often better than InfoBright.

- Inability to handle derived tables is significant drawback for me, I hope it will be fixed

0 0 votes

Article Rating

19 Comments

Oldest

Newest Most Voted

Jonas

16 years ago

It would be interesting to see how an “ordinary” SE performs, say MyIsam.
To get a better understanding how much better these are.

Nicholas Goodman

16 years ago

Jonas – I posted an example of moving one of the queries from “ordinary” MySQL to LucidDB (aka DynamoDB).

LucidDB is 400% faster than ordinary MySQL on these simple queries.
http://www.nicholasgoodman.com/bt/blog/2009/11/02/instant-relief-from-slow-mysql-reporting-queries-using-dynamodb/

MichaelM

16 years ago

Vadim, can you add MyIsam and Innodb in summary table?

Robin

16 years ago

Vadim – Thanks for including us in your tests! One of our engineers will be following up with you shortly on the storage consumption as that seems to be unusual, but we’ll see.

Regarding the lack of subquery support, I agree with you on the need. We decided to implement hash joins in the engine first (which are in) and then come back to do subqueries that perform much better than general MySQL afterwards. You can see our short term roadmap here:http://www.infinidb.org/resources/tech-articles/120-infinidb-community-edition-roadmap, which has subqueries coming out in alpha/beta around mid Q1 next year and being fully ready by first half of 2010.

If you or anyone else has any other wish-list items for us, please hit our suggestion box at http://www.infinidb.org/community/forums/3-suggestion-box or shoot us a feature request via sourceforge.

Vlad Rodionov

16 years ago

Can you guys try these queries on Greenplum single node edition? http://www.greenplum.com/products/single-node/. It is not open source, but its free even for production use.

Bob Dempsey

16 years ago

Vadim,

As Robin said, we appreciate you taking the time to look at InfiniDB. Your feedback is extremely valuable as we work to progress the software beyond Alpha. Hereâ€™s a first pass at your observation:

LOAD DATA INFILE vs. InfiniDB bulk load (cpimport):
We provided support for LOAD DATA INFILE as a way to support the maximum amount of syntax for existing MySQL installations. We have observed that it performs reasonably well up to about 1 million rows. Above 1 million rows, even though using cpimport takes more setup time, we strongly recommend that cpimport be used because we have seen even a 100x improvement in speed as opposed to LOAD DATA INFILE.

cpimport separators:
I wanted to let you know that you can specify another separator value to colxml (using the â€“d option). For example, â€˜-d ,â€™, along with the other colxml arguments will allow the direct import of comma-separated-value files. Note that the separator has to be specified at XML job file creation time (via colxml), not at import time (via cpimport).

database size:
You actually are correct in determining the size of the database by examining the size of 000.dir tree. Right now, about 5GB of the space is used by the InfiniDB system catalog. Also, to help ensure physically contiguous blocks on disk, infinidb currently allocates space for 8 million rows at a time. The space used by a particular column depends on the column type and will vary from 8 MB to 128 MB in each allocation. Weâ€™re looking at ways to optimize this for tables (such as dimension tables) that may have significantly less rows in them.

After taking a moment to look at this schema, this one is more or less a worst-case scenario for InfiniDB in terms of disk space utilization: varchar columns (even NULL ones) occupy a minimum of 8 bytes on disk. For InfiniDB, a varchar(250) column is no more wasteful than a varchar(10) column, but not using that column is wasteful. It would appear that the vast majority of these columns are NULL in the input datasets. They use 0 bytes in the input file (1 if you count the separator), and 8 bytes on disk, for an 8x expansion. Moving such strings out of the fact table and into a dimension table is one obvious solution if disk space is at a premium. We also obviously understand though that we have no control over schema design, and because of that, weâ€™re looking at ways to make this better for instances such as these.

show table status:
At the moment, infinidb is not currently integrated with the MySQL information schema and we currently report 2000 rows for every table. This one is on our list to take a look at as we understand that this is meaningful information for folks.

subselects:
Yes, you are correct, we currently do not support subselect. As Robin mentioned in a previous post, this item is on the roadmap that can be viewed at http://infinidb.org/resources/tech-articles/120-infinidb-community-edition-roadmap. This is one of our top priorities.

q0:
Similar to other solutions, we have the ability to interact with MySQL using the standard storage engine API. By setting vtable_mode = 2, you enable maximum syntax support at the cost of row return rate because it is now going through the standard storage engine API. In this setting, InfiniDB does not take responsibility for aggregation and join steps, only scans and filters. So specifically, in this example, infinidb does not aggregate the rows and returns all of them to MySQL to perform the aggregation.

q4:
For future reference, ‘service infinidb restart’ should accomplish everything a reboot does, without the reboot.

Again, we sincerely appreciate the feedback. Please let me know what else I can answer.

Regards,
Bob

Author

Vadim Tkachenko

16 years ago

Jonas,

I am going to run that against MyISAM, you are not alone who requests that

Author

Vadim Tkachenko

16 years ago

Vlad,

I will try Greenplum if documentation for this product is available.

Author

Vadim Tkachenko

16 years ago

Robin, Bob

Thank you for following our blog and commenting results!

Couple comments from me:

cimport did not understand quoted “” fields, is there way to import file with quotes ?

â€™service infinidb restartâ€™ did not help. I was not able to connect to mysqld after that.

Bob Dempsey

16 years ago

Vadim,

There is no way currently to specify optional separators like double-quotes (‘”‘) to InfiniDB’s bulk loader. I will open an enhancement request for this.

Also, as an FYI, there is no way to store a zero-length string in InfiniDB. All string columns are either NULL or have a length >= 1.

Jim Tommaney

16 years ago

Vadim,

This is very good analysis and feedback, thank you for taking the time to do this.

Of course, InfiniDB is all about a multi-threaded processing model that will benefit from additional cores. So, towards that end, I recreated the data set on two separate InfiniDB instances. A single server installation with a Dell 8-core server @ 2.0GHz, as well as a multi-server implementation. I then used Linux hotplug capabilities to take cores offline to mimic a 2, 4, and 8 core server. In spite of this hack, with 6 out of 8 cores offline the measurements were remarkably similar (a total of 178.6 seconds on the dual Xeon, and a total of 174.56 seconds with 6 cores offline).

Because this is absolutely a different server configuration and to avoid confusion on what was run where, the results are posted here:

http://www.infinidb.org/myblog-admin/infinidb-parallel-processing-of-airline-on-time-data.html

Hopefully, this will give a sense of the multi-threaded capabilities of the system for scan and aggregation. Look to that site for future profiles of our scalable hash-join operation as well.

Author

Vadim Tkachenko

16 years ago

Jim,

Thank you, so as I understood there is no way to enable multi-thread handling in Community edition, right ?

Also queries Q8 looks slow in InfiniDB, is there way to improve ?

Jim Tommaney

16 years ago

Actually, the multi-threaded behavior is enabled by default in community edition.

I agree with regard to Q8, we are doing some profiling and will provide updates when possible.

Author

Vadim Tkachenko

16 years ago

Jim,

So how can I reuse two cores on my system ?

Jim Tommaney

16 years ago

Sorry, bad link above, this is the right one.
http://www.infinidb.org/infinidb-blog/infinidb-parallel-processing-of-airline-on-time-data.html

It’s just a matter of installing the community edition and running queries. The default thread parameters are sufficient to allow full multi-core processing with up to 16 cores. Beyond, that some attention to a couple of parameters may be needed to maximize system utilization.

You should be able to verify with top that cpu utilization approaches 200% with a two cpu system. For example, this is what I see when running Q9 with 6 of 8 cores disable:

[root@srvalpha2 ~]# top -d .25 | grep PrimProc
26264 root 18 -1 14.8g 2.8g 7000 S 183 17.9 68:00.43 PrimProc
26264 root 18 -1 14.8g 2.8g 7000 S 243 17.9 68:02.21 PrimProc
26264 root 18 -1 14.8g 2.8g 7000 S 152 17.9 68:03.39 PrimProc
26264 root 18 -1 14.8g 2.8g 7000 S 188 17.9 68:04.15 PrimProc
26264 root 18 -1 14.8g 2.8g 7000 S 196 17.9 68:04.88 PrimProc
26264 root 18 -1 14.8g 2.8g 7000 S 198 17.9 68:06.24 PrimProc
26264 root 18 -1 14.8g 2.8g 7000 S 171 17.9 68:07.44 PrimProc
26264 root 18 -1 14.8g 2.8g 7000 S 197 17.9 68:09.41 PrimProc
26264 root 18 -1 14.8g 2.8g 7000 S 195 17.9 68:10.75 PrimProc
26264 root 18 -1 14.8g 2.8g 7000 S 200 17.9 68:13.26 PrimProc
26264 root 18 -1 14.8g 2.8g 7000 S 200 17.9 68:15.28 PrimProc
26264 root 18 -1 14.8g 2.8g 7000 S 122 17.9 68:15.59 PrimProc

Author

Vadim Tkachenko

16 years ago

Jim,

Ah, ok, perfect. Now I see, your results for 2 cores are similar with my results, so I assume both cores work here.

That makes a lot of sense, actually your engine is first (at least in Open Edition) that can utilize many cores during single query execution. That explains why InfiniDB is fast in many queries.

Are many cores used during loading data also ?

I am waiting on Q8 optimization and on subqueries, and after that I will give a shoot on 16GB, 8 cores box.

Jim Tommaney

16 years ago

The load process is multi-threaded (for all editions) and would see some benfit from additional cores, but nowhere near linear. My timing to load the 21 years with 8 cores was 6613 seconds vs. above time of 9747 seconds with 2 cores, but there could be a large number of hardware differences besides cores so it is difficult to draw any conclusion. Actual benefits from additional cores depends on a large number of conditions; # of columns, data types, storage, etc.

Jim Tommaney

16 years ago

Just as an aside, we believe we have fixed the memory issue you encountered with load data infile with our latest version on launchpad.

Weidong Zhou

16 years ago

Jim,
The bottleneck is IO, not CPU. The CPUs are waiting for lower level file IO to complete. This is why increasing number of cores not going to help too much. Use the tricks to tune IO will improve the performance. Good luck!

Weidong