Air traffic queries in MyISAM and Tokutek (TokuDB)Vadim Tkachenko
Let me explain the reason of choosing these engines. After initial three posts I am often asked “What is baseline ? Can we compare results with standard MySQL engines ?”. So there come MyISAM to consider it as base point to see how column-oriented-analytic engines are better here.
However, take into account, that for MyISAM we need to choose proper indexes to execute queries effectively, and there is pain coming with indexes: – load of data is getting slower; – to design proper indexes is additional research, especially when MySQL optimizer is not smart in picking best one.
The really nice thing about MonetDB, InfoBright, InfiniDB is that they do not need indexes, so you may not worry about maintaining them and picking best one. I am not sure about LucidDB, I was told indexes are needed, but creating new index was really fast even on full database, so I guess, it’s not B-Tree indexes. So this my reflexion on indexes turned me onto TokuDB direction.
What is so special about TokuDB ? There two things: indexes have special structure and are “cheap”, by “cheap” I mean the maintenance cost is constant and independent on datasize. With regular B-Tree indexes cost grows exponentially on datasize (Bradley Kuszmaul from Tokutek will correct me if I am wrong in this statement). Another point with TokuDB, it uses compression, so I expect less size of loaded data and less IO operations during query execution.
So what indexes we need for queries. To recall you details, the schema is available in this post
queries I posted on sheet “Queries” in my summary Spreadsheet.
With Bradley’s help we chose next indexes:
KEY `Year` (`Year`,`Month`),
KEY `Year_2` (`Year`,`DayOfWeek`),
KEY `DayOfWeek` (`DayOfWeek`,`Year`,`DepDelay`),
KEY `DestCityName` (`DestCityName`,`OriginCityName`,`Year`),
KEY `Year_3` (`Year`,`DestCityName`,`OriginCityName`),
KEY `Year_4` (`Year`,`Carrier`,`DepDelay`),
KEY `Origin` (`Origin`,`Year`,`DepDelay`)
And I measured load time for both MyISAM and TokuDB in empty table with created indexes.
Load time for MyISAM: 16608 sec
For TokuDB: 19131 sec
Datasize (including indexes)
I am a bit surprised that TokuDB is slower loading data, but my guess it is related to compression, and I expect with bigger amount of data TokuDB will be faster MyISAM.
Now to queries. Bradley pointed me that query Q5
SELECT t.carrier, c, c2, c*1000/c2 as c3 FROM (SELECT carrier,
count(*) AS c FROM ontime WHERE DepDelay>10 AND Year=2007 GROUP BY
carrier) t JOIN (SELECT carrier, count(*) AS c2 FROM ontime WHERE
Year=2007 GROUP BY carrier) t2 ON (t.Carrier=t2.Carrier) ORDER BY c3
can be rewritten as
SELECT carrier,totalflights,ndelayed,ndelayed*1000/totalflights as c3 FROM (SELECT carrier,count(*) as totalflights,sum(if(depdelay>10,1,0)) as ndelayed from ontime where year=2007 group by carrier) t order by c3 desc;
( I name it as Query Q5i)
The summary table with queries execution time (in sec, less is better):
For reference I used 5.1.36-Tokutek-2.1.0 for both MyISAM and TokuDB tests.
And if you are interested to compare MyISAM with previous engines:
The all results are available in summary Spreadsheet
I especially do not put TokuDB in the same table with analytic oriented databases, to highlight TokuDB is OLTP engine for general purposes.
As you see it is doing better than MyISAM in all queries.