Accidentally me and Baron played with InfoBright (see https://www.percona.com/blog/2009/09/29/quick-comparison-of-myisam-infobright-and-monetdb/) this week. And following Baron’s example I also run the same load against MonetDB. Reading comments to Baron’s post I tied to load the same data to LucidDB, but I was not successful in this. I tried to analyze a bigger dataset and I took public available data http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time about USA domestic flights with information about flight length and delays. The data is available from 1988 to 2009 in chunks per month, so I downloaded 252 files (for 1988-2008 years) with size from 170MB to 300MB each. In total raw data is about 55GB. Average amount of rows in each chunk is 483762.46 (the query Q0 is:
|
1 |
select avg(c1) from (select year,month,count(*) as c1 from ontime group by YEAR,month) t; |
for InfoBright and
|
1 |
with t as (select yeard,monthd,count(*) as c1 from ontime group by YEARD,monthd) select AVG(c1) FROM t |
for MonetDB. For InfoBright it took 4.19 sec to execute and 29.9 sec for MonetDB, but it’s almost single case where MonetDB was significantly slower) Few words about environment: server Dell SC1425, with 4GB of RAM and Dual Intel(R) Xeon(TM) CPU 3.40GHz. InfoBright (ICE) version: 5.1.14-log build number (revision)=IB_3.2_GA_5316(ice) MonetDB version: server v5.14.2, based on kernel v1.32.2 LucidDB was 0.9.1 The table I loaded data is:
|
1 |
CREATE TABLE `ontime` ( `Year` year(4) DEFAULT NULL, `Quarter` tinyint(4) DEFAULT NULL, `Month` tinyint(4) DEFAULT NULL, `DayofMonth` tinyint(4) DEFAULT NULL, `DayOfWeek` tinyint(4) DEFAULT NULL, `FlightDate` date DEFAULT NULL, `UniqueCarrier` char(7) DEFAULT NULL, `AirlineID` int(11) DEFAULT NULL, `Carrier` char(2) DEFAULT NULL, `TailNum` varchar(50) DEFAULT NULL, `FlightNum` varchar(10) DEFAULT NULL, `Origin` char(5) DEFAULT NULL, `OriginCityName` varchar(100) DEFAULT NULL, `OriginState` char(2) DEFAULT NULL, `OriginStateFips` varchar(10) DEFAULT NULL, `OriginStateName` varchar(100) DEFAULT NULL, `OriginWac` int(11) DEFAULT NULL, `Dest` char(5) DEFAULT NULL, `DestCityName` varchar(100) DEFAULT NULL, `DestState` char(2) DEFAULT NULL, `DestStateFips` varchar(10) DEFAULT NULL, `DestStateName` varchar(100) DEFAULT NULL, `DestWac` int(11) DEFAULT NULL, `CRSDepTime` int(11) DEFAULT NULL, `DepTime` int(11) DEFAULT NULL, `DepDelay` int(11) DEFAULT NULL, `DepDelayMinutes` int(11) DEFAULT NULL, `DepDel15` int(11) DEFAULT NULL, `DepartureDelayGroups` int(11) DEFAULT NULL, `DepTimeBlk` varchar(20) DEFAULT NULL, `TaxiOut` int(11) DEFAULT NULL, `WheelsOff` int(11) DEFAULT NULL, `WheelsOn` int(11) DEFAULT NULL, `TaxiIn` int(11) DEFAULT NULL, `CRSArrTime` int(11) DEFAULT NULL, `ArrTime` int(11) DEFAULT NULL, `ArrDelay` int(11) DEFAULT NULL, `ArrDelayMinutes` int(11) DEFAULT NULL, `ArrDel15` int(11) DEFAULT NULL, `ArrivalDelayGroups` int(11) DEFAULT NULL, `ArrTimeBlk` varchar(20) DEFAULT NULL, `Cancelled` tinyint(4) DEFAULT NULL, `CancellationCode` char(1) DEFAULT NULL, `Diverted` tinyint(4) DEFAULT NULL, `CRSElapsedTime` INT(11) DEFAULT NULL, `ActualElapsedTime` INT(11) DEFAULT NULL, `AirTime` INT(11) DEFAULT NULL, `Flights` INT(11) DEFAULT NULL, `Distance` INT(11) DEFAULT NULL, `DistanceGroup` TINYINT(4) DEFAULT NULL, `CarrierDelay` INT(11) DEFAULT NULL, `WeatherDelay` INT(11) DEFAULT NULL, `NASDelay` INT(11) DEFAULT NULL, `SecurityDelay` INT(11) DEFAULT NULL, `LateAircraftDelay` INT(11) DEFAULT NULL, `FirstDepTime` varchar(10) DEFAULT NULL, `TotalAddGTime` varchar(10) DEFAULT NULL, `LongestAddGTime` varchar(10) DEFAULT NULL, `DivAirportLandings` varchar(10) DEFAULT NULL, `DivReachedDest` varchar(10) DEFAULT NULL, `DivActualElapsedTime` varchar(10) DEFAULT NULL, `DivArrDelay` varchar(10) DEFAULT NULL, `DivDistance` varchar(10) DEFAULT NULL, `Div1Airport` varchar(10) DEFAULT NULL, `Div1WheelsOn` varchar(10) DEFAULT NULL, `Div1TotalGTime` varchar(10) DEFAULT NULL, `Div1LongestGTime` varchar(10) DEFAULT NULL, `Div1WheelsOff` varchar(10) DEFAULT NULL, `Div1TailNum` varchar(10) DEFAULT NULL, `Div2Airport` varchar(10) DEFAULT NULL, `Div2WheelsOn` varchar(10) DEFAULT NULL, `Div2TotalGTime` varchar(10) DEFAULT NULL, `Div2LongestGTime` varchar(10) DEFAULT NULL, `Div2WheelsOff` varchar(10) DEFAULT NULL, `Div2TailNum` varchar(10) DEFAULT NULL, `Div3Airport` varchar(10) DEFAULT NULL, `Div3WheelsOn` varchar(10) DEFAULT NULL, `Div3TotalGTime` varchar(10) DEFAULT NULL, `Div3LongestGTime` varchar(10) DEFAULT NULL, `Div3WheelsOff` varchar(10) DEFAULT NULL, `Div3TailNum` varchar(10) DEFAULT NULL, `Div4Airport` varchar(10) DEFAULT NULL, `Div4WheelsOn` varchar(10) DEFAULT NULL, `Div4TotalGTime` varchar(10) DEFAULT NULL, `Div4LongestGTime` varchar(10) DEFAULT NULL, `Div4WheelsOff` varchar(10) DEFAULT NULL, `Div4TailNum` varchar(10) DEFAULT NULL, `Div5Airport` varchar(10) DEFAULT NULL, `Div5WheelsOn` varchar(10) DEFAULT NULL, `Div5TotalGTime` varchar(10) DEFAULT NULL, `Div5LongestGTime` varchar(10) DEFAULT NULL, `Div5WheelsOff` varchar(10) DEFAULT NULL, `Div5TailNum` varchar(10) DEFAULT NULL ) ENGINE=BRIGHTHOUSE DEFAULT CHARSET=latin1; |
Last fields starting with “Div*” are not really used. Load procedure: Infobright: the loader that comes with ICE version is very limited and I had to transform files to quote each field. After that load statement is:
|
1 |
mysql -S /tmp/mysql-ib.sock -e "LOAD DATA INFILE '/data/d1/AirData_ontime/${YEAR}_$i.txt.tr' INTO TABLE ontime FIELDS TERMINATED BY ',' ENCLOSED BY '"'" ontime |
The load time for each chunk was about 30s/chunk in initial years and up to 48s/chunk for 2008 year. And total load time is 8836 sec (2.45h). The size of database after load is 1.6G which is impressive and give 1:34 compress ratio. MonetDB: It took some time to figure out how to load text data ( I really wish developers improve documentation), but finally I ended up with next load statement:
|
1 |
/usr/local/monetdb/bin/mclient -lsql --database=ontime -t -s "COPY 700000 records INTO ontime FROM '/data/d1/AirData_ontime/${Y EAR}_$i.txt' USING DELIMITERS ',','n','"' NULL AS '';" |
Load time: 13065 sec ( 3.6h) Database size after load is 65G , which is discouraging. It seems it does not use any compression, and it’s bigger than original data. LucidDB Here it took time to find how to execute command from command line using included
|
1 |
sqlline |
utility, and I did not understand how to do that, so I generated big SQL file which contained load statements. Load of each chunk was significantly slower starting with about 60 sec/chunk for initial year and constantly growing to 200 sec / chunk for 2000 year. On 2004 year (after about 5h of loading) the load failed by some reason and I did not try to repeat, as I would not fit in timeframe I allocated for this benchmark. Maybe I will try sometime again. Query execution So I really have data for InfoBright and MonetDB, let see how fast they are in different queries. First favorite query for any database benchmarker is
|
1 |
SELECT count(*) FROM ontime; |
. Both InforBritgh and MonetDB executes it immediately with result 117023290 rows Now some random queries I tried again both databases: -Q1: Count flights per day from 2000 to 2008 years
|
1 |
SELECT DayOfWeek, count(*) AS c FROM ontime WHERE YearD BETWEEN 2000 AND 2008 GROUP BY DayOfWeek ORDER BY c DESC |
with result: [ 5, 7509643 ] [ 1, 7478969 ] [ 4, 7453687 ] [ 3, 7412939 ] [ 2, 7370368 ] [ 7, 7095198 ] [ 6, 6425690 ] And it took 7.9s for MonetDB and 12.13s for InfoBright. -Q2: Count of flights delayed more than 10min per day of week for 2000-2008 years
|
1 |
SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND YearD BETWEEN 2000 AND 2008 GROUP BY DayOfWeek ORDER BY c DESC |
Result: [ 5, 1816486 ] [ 4, 1665603 ] [ 1, 1582109 ] [ 7, 1555145 ] [ 3, 1431248 ] [ 2, 1348182 ] [ 6, 1202457 ] And 0.9s execution for MonetDB and 6.37s for InfoBright. -Q3: Count of delays per airport for years 2000-2008
|
1 |
SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay>10 AND YearD BETWEEN 2000 AND 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10 |
[ “ORD”, 739286 ] [ “ATL”, 736736 ] [ “DFW”, 516957 ] [ “PHX”, 336360 ] [ “LAX”, 331997 ] [ “LAS”, 307677 ] [ “DEN”, 306594 ] [ “EWR”, 262007 ] [ “IAH”, 255789 ] [ “DTW”, 248005 ] with 1.7s for MonetDB and 7.29s for InfoBright -Q4: Count of delays per Carrier for 2007 year
|
1 |
SELECT carrier, count(*) FROM ontime WHERE DepDelay>10 AND YearD=2007 GROUP BY carrier ORDER BY 2 DESC |
[ “WN”, 296293 ] [ “AA”, 176203 ] … With 0.27s for MonetDB and 0.99sec for InfoBright But it obvious that the more flight carrier has, the more delays, so to be fair, let’s calculate -Q5: Percentage of delays for each carrier for 2007 year. It is a bit more trickier, as for InfoBright and MonetDB you need different query: MonetDB:
|
1 |
WITH t AS (SELECT carrier, count(*) AS c FROM ontime WHERE DepDelay>10 AND YearD=2007 GROUP BY carrier), t2 AS (SELECT carrier, count(*) AS c2 FROM ontime WHERE YearD=2007 GROUP BY carrier) SELECT t.carrier, c, c2, c*1000/c2 as c3 FROM t JOIN t2 ON (t.Carrier=t2.Carrier) ORDER BY c3 DESC |
InfoBright:
|
1 |
SELECT t.carrier, c, c2, c*1000/c2 as c3 FROM (SELECT carrier, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year=2007 GROUP BY carrier) t JOIN (SELECT carrier, count(*) AS c2 FROM ontime WHERE Year=2007 GROUP BY carrier) t2 ON (t.Carrier=t2.Carrier) ORDER BY c3 DESC; |
I am using c*1000/c2 here, because MonetDB seems using integer arithmetic and, with c/c2 I received just 1. So result is: [ “EV”, 101796, 286234, 355 ] [ “US”, 135987, 485447, 280 ] [ “AA”, 176203, 633857, 277 ] [ “MQ”, 145630, 540494, 269 ] [ “AS”, 42830, 160185, 267 ] [ “B6”, 50740, 191450, 265 ] [ “UA”, 128174, 490002, 261 ] … with execution time: 0.5s for MonetDB and 2.92s for InfoBright. Warnings: do not try EXPLAIN this query in InfoBright. MySQL is really stupid here, and EXPLAIN for this query took 6 min! If you wonder about carriers – EV is Atlantic Southeast Airlines and US is US Airways Inc. 35.5% flights of Atlantic Southeast Airlines was delayed on more than 10 mins! -Q6: Let’s try the same query for wide range of years 2000-2008: Result is: [ “EV”, 443798, 1621140, 273 ] [ “AS”, 299282, 1207960, 247 ] [ “B6”, 191250, 787113, 242 ] [ “WN”, 1885942, 7915940, 238 ] [ “FL”, 287815, 1220663, 235 ] … And execution 12.5s MonetDB and 21.83s InfoBright. (AS is Alaska Airlines Inc. and B6 is JetBlue Airways) -Q7: Percent of delayed (more 10mins) flights per year: MonetDB:
|
1 |
with t as (select YEARD,count(*)*1000 as c1 from ontime WHERE DepDelay>10 GROUP BY YearD), t2 as (select YEARD,count(*) as c2 from ontime GROUP BY YEARD) select t.YEARD, c1/c2 FROM t JOIN t2 ON (t.YEARD=t2.YEARD) |
InfoBright:
|
1 |
SELECT t.YEARD, c1/c2 FROM (select YEARD,count(*)*1000 as c1 from ontime WHERE DepDelay>10 GROUP BY YearD) t JOIN (select YEARD,count(*) as c2 from ontime GROUP BY YEARD) t2 ON (t.YEARD=t2.YEARD) |
with result: [ 1988, 166 ] [ 1989, 199 ] [ 1990, 166 ] [ 1991, 147 ] [ 1992, 146 ] [ 1993, 154 ] [ 1994, 165 ] [ 1995, 193 ] [ 1996, 221 ] [ 1997, 191 ] [ 1998, 193 ] [ 1999, 200 ] [ 2000, 231 ] [ 2002, 163 ] [ 2003, 153 ] [ 2004, 192 ] [ 2005, 210 ] [ 2006, 231 ] [ 2007, 245 ] [ 2008, 219 ] And with execution time 27.9s MonetDB and 8.59s InfoBright. It seems MonetDB does not like scanning wide range of rows, the slowness here is similar to Q0. -Q8: As final I tested most popular destination in sense count of direct connected cities for different diapason of years.
|
1 |
SELECT DestCityName, COUNT( DISTINCT OriginCityName) FROM ontime WHERE Year BETWEEN N and M GROUP BY DestCityName ORDER BY 2 DESC LIMIT 10; |
Years, InfoBright, MonetDB 1y, 5.88s, 0.55s 2y, 11.77s, 1.10s 3y, 17.61s, 1.69s 4y, 37.57s, 2.12s 10y, 79.77s, 29.14s UPDATE (5-Oct-2009): James Birchall recommended to use hint COMMENT ‘lookup’ for fields in InfoBright that have less 10.000 distinct values. I tried that, and it affected only these queries. Results with changes: Years, InfoBright, MonetDB 1y, 1.74s, 0.55s 2y, 3.68s, 1.10s 3y, 5.44s, 1.69s 4y, 7.22s, 2.12s 10y, 17.42s, 29.14s -Q9: And prove that MonetDB does not like to scan many records, there is query
|
1 |
select year,count(*) as c1 from ontime group by YEAR |
which shows how many records per years +——+———+ | year | c1 | +——+———+ | 1989 | 5041200 | | 1990 | 5270893 | | 1991 | 5076925 | | 1992 | 5092157 | | 1993 | 5070501 | | 1994 | 5180048 | | 1995 | 5327435 | | 1996 | 5351983 | | 1997 | 5411843 | | 1998 | 5384721 | | 1999 | 5527884 | | 2000 | 5683047 | | 2001 | 5967780 | | 2002 | 5271359 | | 2003 | 6488540 | | 2004 | 7129270 | | 2005 | 7140596 | | 2006 | 7141922 | | 2007 | 7455458 | | 2008 | 7009728 | +——+———+ And execution time: MonetDB: 6.3s and InfoBright: 0.31s To group all results there is graph:
Conclusions:
I am open to run any other queries if you want to compare or get info about air performance.