October 30, 2014

Air traffic queries in InfiniDB: early alpha

As Calpont announced availability of InfiniDB I surely couldn’t miss a chance to compare it with previously tested databases in the same environment.
See my previous posts on this topic:
Analyzing air traffic performance with InfoBright and MonetDB
Air traffic queries in LucidDB

I could not run all queries against InfiniDB and I met some hiccups during my experiment, so it was less plain experience than with other databases.


So let’s go by the same steps:

Load data

InfiniDB supports MySQL’s

statement and it’s own

utilities. As

is more familiar for me, I started with that, however after issuing LOAD DATA on 180MB file ( for 1989 year, 1st month) very soon it caused extensive swapping (my box has 4GB of RAM) and statement failed with

Alright,

was more successful, however it has less flexibility in syntax than

, so I had to transform the input files into a format that

could understand.

Total load time was 9747 sec or 2.7h (not counting time spent on files transformation)

I put summary data into on load data time, datasize and query time to Google Spreadsheet so you can easy compare with previous results. There are different sheets for queries, datasize and time of load.

Datasize

Size of database after loading is another confusing point. InfiniDB data directory has complex structure like

so it’s hard to day what files are related to table. But after load, the size of 000.dir is 114G, which is as twice big as original data files. SHOW TABLE STATUS does not really help there, it shows

with totally misleading information.

So I put 114GB as size of data after load, until someone points me how to get real size, and also explains what takes so much space.

Queries

First count start query

took 2.67 sec, which shows that InfiniDB does not store counter of records, however calculates it pretty fast.

Q0:

Another bumper, on this query InfiniDB complains

Ok, so InfiniDB does not support DERIVED TABLES, which is big limitation from my point of view.
As workaround I tried to create temporary table, but got another error:

As warning suggests I turned

, which is:

but query took 667 sec :

so I skip queries Q5, Q6, Q7 from consideration, which are also based on DERIVED TABLES, as not supported by InfiniDB.

Other queries: (again look on comparison with other engines in Google Spreadsheet or in summary table at the bottom)

Query Q1:

7 rows in set (6.79 sec)

Query Q2:

7 rows in set (4.59 sec)

Query Q3:

4.96 sec

Query Q4:

I had another surprise with query, after 15 min it did not return results, I check system and it was totally idle, but query stuck. I killed query, restarted mysqld but could not connect to mysqld anymore. In processes I see that InfiniDB started couple external processes:

which cooperate each with other using IPC shared memory and semaphores. To clean system I rebooted server, and only after that mysqld was able to start.

After that query Q4 took 0.75 sec

Queries Q5-Q7 skipped.

Query Q8:

And times for InfiniDB:

1y: 8.13 sec
2y: 16.54 sec
3y: 24.46 sec
4y: 32.49 sec
10y: 1 min 10.35 sec

Query Q9:

Q9:

Time: 9.54 sec

Ok, so there is summary table with queries times (in sec, less is better)

QueryMonetDBInfoBrightLucidDBInfiniDB
Q029.94.19103.21NA
Q17.912.1349.176.79
Q20.96.7327.134.59
Q31.77.2927.664.96
Q40.270.992.340.75
Q50.52.927.35NA
Q612.521.8378.42NA
Q727.98.59106.37NA
Q8 (1y)0.551.746.768.13
Q8 (2y)1.13.6828.8216.54
Q8 (3y)1.695.4435.3724.46
Q8 (4y)2.127.2241.6632.49
Q8 (10y)29.1417.4272.6770.35
Q96.30.3176.129.54

Conclusions

  • InfiniDB server version shows, so I consider it as alpha release, and it is doing OK for alpha. I will wait for more stable release for further tests, as it took good amount of time to deal with different glitches.
  • InfiniDB shows really good time for queries it can handle, quite often better than InfoBright.
  • Inability to handle derived tables is significant drawback for me, I hope it will be fixed
About Vadim Tkachenko

Vadim leads Percona's development group, which produces Percona Clould Tools, the Percona Server, Percona XraDB Cluster and Percona XtraBackup. He is an expert in solid-state storage, and has helped many hardware and software providers succeed in the MySQL market.

Comments

  1. Jonas says:

    It would be interesting to see how an “ordinary” SE performs, say MyIsam.
    To get a better understanding how much better these are.

  2. Jonas – I posted an example of moving one of the queries from “ordinary” MySQL to LucidDB (aka DynamoDB).

    LucidDB is 400% faster than ordinary MySQL on these simple queries.
    http://www.nicholasgoodman.com/bt/blog/2009/11/02/instant-relief-from-slow-mysql-reporting-queries-using-dynamodb/

  3. MichaelM says:

    Vadim, can you add MyIsam and Innodb in summary table?

  4. Robin says:

    Vadim – Thanks for including us in your tests! One of our engineers will be following up with you shortly on the storage consumption as that seems to be unusual, but we’ll see.

    Regarding the lack of subquery support, I agree with you on the need. We decided to implement hash joins in the engine first (which are in) and then come back to do subqueries that perform much better than general MySQL afterwards. You can see our short term roadmap here:http://www.infinidb.org/resources/tech-articles/120-infinidb-community-edition-roadmap, which has subqueries coming out in alpha/beta around mid Q1 next year and being fully ready by first half of 2010.

    If you or anyone else has any other wish-list items for us, please hit our suggestion box at http://www.infinidb.org/community/forums/3-suggestion-box or shoot us a feature request via sourceforge.

  5. Vlad Rodionov says:

    Can you guys try these queries on Greenplum single node edition? http://www.greenplum.com/products/single-node/. It is not open source, but its free even for production use.

  6. Vadim,

    As Robin said, we appreciate you taking the time to look at InfiniDB. Your feedback is extremely valuable as we work to progress the software beyond Alpha. Here’s a first pass at your observation:

    LOAD DATA INFILE vs. InfiniDB bulk load (cpimport):
    We provided support for LOAD DATA INFILE as a way to support the maximum amount of syntax for existing MySQL installations. We have observed that it performs reasonably well up to about 1 million rows. Above 1 million rows, even though using cpimport takes more setup time, we strongly recommend that cpimport be used because we have seen even a 100x improvement in speed as opposed to LOAD DATA INFILE.

    cpimport separators:
    I wanted to let you know that you can specify another separator value to colxml (using the –d option). For example, ‘-d ,’, along with the other colxml arguments will allow the direct import of comma-separated-value files. Note that the separator has to be specified at XML job file creation time (via colxml), not at import time (via cpimport).

    database size:
    You actually are correct in determining the size of the database by examining the size of 000.dir tree. Right now, about 5GB of the space is used by the InfiniDB system catalog. Also, to help ensure physically contiguous blocks on disk, infinidb currently allocates space for 8 million rows at a time. The space used by a particular column depends on the column type and will vary from 8 MB to 128 MB in each allocation. We’re looking at ways to optimize this for tables (such as dimension tables) that may have significantly less rows in them.

    After taking a moment to look at this schema, this one is more or less a worst-case scenario for InfiniDB in terms of disk space utilization: varchar columns (even NULL ones) occupy a minimum of 8 bytes on disk. For InfiniDB, a varchar(250) column is no more wasteful than a varchar(10) column, but not using that column is wasteful. It would appear that the vast majority of these columns are NULL in the input datasets. They use 0 bytes in the input file (1 if you count the separator), and 8 bytes on disk, for an 8x expansion. Moving such strings out of the fact table and into a dimension table is one obvious solution if disk space is at a premium. We also obviously understand though that we have no control over schema design, and because of that, we’re looking at ways to make this better for instances such as these.

    show table status:
    At the moment, infinidb is not currently integrated with the MySQL information schema and we currently report 2000 rows for every table. This one is on our list to take a look at as we understand that this is meaningful information for folks.

    subselects:
    Yes, you are correct, we currently do not support subselect. As Robin mentioned in a previous post, this item is on the roadmap that can be viewed at http://infinidb.org/resources/tech-articles/120-infinidb-community-edition-roadmap. This is one of our top priorities.

    q0:
    Similar to other solutions, we have the ability to interact with MySQL using the standard storage engine API. By setting vtable_mode = 2, you enable maximum syntax support at the cost of row return rate because it is now going through the standard storage engine API. In this setting, InfiniDB does not take responsibility for aggregation and join steps, only scans and filters. So specifically, in this example, infinidb does not aggregate the rows and returns all of them to MySQL to perform the aggregation.

    q4:
    For future reference, ‘service infinidb restart’ should accomplish everything a reboot does, without the reboot.

    Again, we sincerely appreciate the feedback. Please let me know what else I can answer.

    Regards,
    Bob

  7. Vadim says:

    Jonas,

    I am going to run that against MyISAM, you are not alone who requests that

  8. Vadim says:

    Vlad,

    I will try Greenplum if documentation for this product is available.

  9. Vadim says:

    Robin, Bob

    Thank you for following our blog and commenting results!

    Couple comments from me:

    cimport did not understand quoted “” fields, is there way to import file with quotes ?

    ’service infinidb restart’ did not help. I was not able to connect to mysqld after that.

  10. Vadim,

    There is no way currently to specify optional separators like double-quotes (‘”‘) to InfiniDB’s bulk loader. I will open an enhancement request for this.

    Also, as an FYI, there is no way to store a zero-length string in InfiniDB. All string columns are either NULL or have a length >= 1.

  11. Vadim,

    This is very good analysis and feedback, thank you for taking the time to do this.

    Of course, InfiniDB is all about a multi-threaded processing model that will benefit from additional cores. So, towards that end, I recreated the data set on two separate InfiniDB instances. A single server installation with a Dell 8-core server @ 2.0GHz, as well as a multi-server implementation. I then used Linux hotplug capabilities to take cores offline to mimic a 2, 4, and 8 core server. In spite of this hack, with 6 out of 8 cores offline the measurements were remarkably similar (a total of 178.6 seconds on the dual Xeon, and a total of 174.56 seconds with 6 cores offline).

    Because this is absolutely a different server configuration and to avoid confusion on what was run where, the results are posted here:

    http://www.infinidb.org/myblog-admin/infinidb-parallel-processing-of-airline-on-time-data.html

    Hopefully, this will give a sense of the multi-threaded capabilities of the system for scan and aggregation. Look to that site for future profiles of our scalable hash-join operation as well.

  12. Vadim says:

    Jim,

    Thank you, so as I understood there is no way to enable multi-thread handling in Community edition, right ?

    Also queries Q8 looks slow in InfiniDB, is there way to improve ?

  13. Actually, the multi-threaded behavior is enabled by default in community edition.

    I agree with regard to Q8, we are doing some profiling and will provide updates when possible.

  14. Vadim says:

    Jim,

    So how can I reuse two cores on my system ?

  15. Sorry, bad link above, this is the right one.
    http://www.infinidb.org/infinidb-blog/infinidb-parallel-processing-of-airline-on-time-data.html

    It’s just a matter of installing the community edition and running queries. The default thread parameters are sufficient to allow full multi-core processing with up to 16 cores. Beyond, that some attention to a couple of parameters may be needed to maximize system utilization.

    You should be able to verify with top that cpu utilization approaches 200% with a two cpu system. For example, this is what I see when running Q9 with 6 of 8 cores disable:

    [root@srvalpha2 ~]# top -d .25 | grep PrimProc
    26264 root 18 -1 14.8g 2.8g 7000 S 183 17.9 68:00.43 PrimProc
    26264 root 18 -1 14.8g 2.8g 7000 S 243 17.9 68:02.21 PrimProc
    26264 root 18 -1 14.8g 2.8g 7000 S 152 17.9 68:03.39 PrimProc
    26264 root 18 -1 14.8g 2.8g 7000 S 188 17.9 68:04.15 PrimProc
    26264 root 18 -1 14.8g 2.8g 7000 S 196 17.9 68:04.88 PrimProc
    26264 root 18 -1 14.8g 2.8g 7000 S 198 17.9 68:06.24 PrimProc
    26264 root 18 -1 14.8g 2.8g 7000 S 171 17.9 68:07.44 PrimProc
    26264 root 18 -1 14.8g 2.8g 7000 S 197 17.9 68:09.41 PrimProc
    26264 root 18 -1 14.8g 2.8g 7000 S 195 17.9 68:10.75 PrimProc
    26264 root 18 -1 14.8g 2.8g 7000 S 200 17.9 68:13.26 PrimProc
    26264 root 18 -1 14.8g 2.8g 7000 S 200 17.9 68:15.28 PrimProc
    26264 root 18 -1 14.8g 2.8g 7000 S 122 17.9 68:15.59 PrimProc

  16. Vadim says:

    Jim,

    Ah, ok, perfect. Now I see, your results for 2 cores are similar with my results, so I assume both cores work here.

    That makes a lot of sense, actually your engine is first (at least in Open Edition) that can utilize many cores during single query execution. That explains why InfiniDB is fast in many queries.

    Are many cores used during loading data also ?

    I am waiting on Q8 optimization and on subqueries, and after that I will give a shoot on 16GB, 8 cores box.

  17. The load process is multi-threaded (for all editions) and would see some benfit from additional cores, but nowhere near linear. My timing to load the 21 years with 8 cores was 6613 seconds vs. above time of 9747 seconds with 2 cores, but there could be a large number of hardware differences besides cores so it is difficult to draw any conclusion. Actual benefits from additional cores depends on a large number of conditions; # of columns, data types, storage, etc.

  18. Just as an aside, we believe we have fixed the memory issue you encountered with load data infile with our latest version on launchpad.

  19. Weidong Zhou says:

    Jim,
    The bottleneck is IO, not CPU. The CPUs are waiting for lower level file IO to complete. This is why increasing number of cores not going to help too much. Use the tricks to tune IO will improve the performance. Good luck!

    Weidong

Speak Your Mind

*