Recently I was doing a little work for a client who has MyISAM tables with many columns (the same one Peter wrote about recently). The client’s performance is suffering in part because of the number of columns, which is over 200. The queries are generally pretty simple (sums of columns), but they’re ad-hoc (can access any columns) and it seems tailor-made for a column-oriented database.
I decided it was time to actually give Infobright a try. They have an open-source community edition, which is crippled but not enough to matter for this test. The “Knowledge Grid” architecture seems ideal for the types of queries the client runs. But hey, why not also try MonetDB, another open-source column-oriented database I’ve been meaning to take a look at?
What follows is not a realistic benchmark, it’s not scientific, it’s just some quick and dirty tinkering. I threw up an Ubuntu 9.04 small server on Amazon. (I used this version because there’s a .deb of MonetDB for it). I created a table with 200 integer columns and loaded it with random numbers between 0 and 10000. Initially I wanted to try with 4 million rows, but I had trouble with MonetDB — there was not enough memory for this. I didn’t do anything fancy with the Amazon server — I didn’t fill up the /mnt disk to claim the bits, for example. I used default tuning, out of the box, for all three databases.
The first thing I tried doing was loading the data with SQL statements. I wanted to see how fast MyISAM vs. MonetDB would interpret really large INSERT statements, the kind produced by mysqldump. But MonetDB choked and told me the number of columns mismatched. I found reference to this on the mailing list, and skipped that. I used LOAD DATA INFILE instead (MonetDB’s version of that is COPY INTO). This is the only way to get data into Infobright, anyway.
I loaded 1 million rows into the table. Here’s a graph of the times (smaller is better):
MyISAM took 88 seconds, MonetDB took 200, and Infobright took 486. Here’s the size of the resulting table on disk (smaller is better):
MyISAM is 787MB, MonetDB is 791MB, and Infobright is 317MB. Next I ran three queries:
select sum(c19), sum(c89), sum(c129) from t;
select sum(c19), sum(c89), sum(c129) from t where c11 > 5;
select sum(c19), sum(c89), sum(c129) from t where c11 < 5;
Graphs of query performance time for all three databases are really not very helpful, because MyISAM is so much slower that you can’t see the graphs for the others. So I’ll give the numbers and then omit MyISAM from the graphs. Here are the numbers for everything I measured:
|size (bytes)Â Â Â||826000000Â Â Â||829946723||332497242|
|load time (seconds)Â Â Â||88Â Â Â||200Â Â Â||486|
|query1 timeÂ Â Â||3.4Â Â Â||0.012Â Â Â||0.0007|
|query2 timeÂ Â Â||3.4Â Â Â||0.15Â Â Â||1.2|
|query3 timeÂ Â Â||2.5Â Â Â||0.076Â Â Â||0.15|
And here is a graph of Infobright duking it out with MonetDB on the three queries I tested (shorter bar is better):
I ran each query a few times, discarded the first run, and averaged the next three together.
A few miscellaneous notes: don’t forget that Infobright is not just a storage engine plugged into MySQL. It’s a complete server with a different optimizer, etc. This point was hammered home during the LOAD DATA INFILE, when I looked to see what was taking so long (I was tempted to use oprofile and see if there are sleep() statements). What did I see in ‘top’ but a program called bhloader. This bhloader program was the only thing doing anything; mysqld wasn’t doing a thing. LOAD DATA INFILE in Infobright isn’t what it seems to be. Otherwise, Infobright behaved about as I expected it to; it seemed pretty normal to a MySQL guy.
MonetDB was a bit different. I had to be a bit resourceful to get everything going. The documentation was for an old version, and was pretty sparse. I had to go to the mailing lists to find the correct COPY syntax — it wasn’t that listed in the online manual. And there were funny things like a “merovingian” process (think “angel”) that had to be started before the server would start, and I had to destroy the demo database and recreate it before I could start it as shown in the tutorials.
MonetDB has some unexpected properties; it is not a regular RDBMS. Still, I’m quite impressed by it in some ways. For example, it seems quite nicely put together, and it’s not at all hard to learn.
It doesn’t really “speak SQL” — it speaks relational algebra, and the SQL is just a front-end to it. You can talk XQuery to it, too. I’m not sure if you can talk dirty to it, but you can sure talk nerdy to it: you can, should you choose to, give it instructions in MonetDB Assembly Language (MAL), the underlying language. An abstracted front-end is a great idea; MySQL abstracts the storage backend, but why not do both? Last I checked, Drizzle is going this direction, hurrah!
EXPLAIN is enlightening and frightening! You get to see the intermediate code from the compiler. The goggles, they do nothing!
From what I was able to learn about MonetDB in an hour, I believe it uses memory-mapped files to hold the data in-memory. If this is true, it explains why I couldn’t load 4 million rows into it (this was a 32-bit Amazon machine).
The SQL implementation is impressive. It’s a really solid subset of SQL:2003, much more than I expected. It even has CTEs, although not recursive ones. (No, there is no REPLACE, and there is no INSERT/ON DUPLICATE KEY UPDATE.) I didn’t try the XQuery interface.
Although I didn’t try it out, there are what looks like pretty useful instrumentation interfaces for profiling, debugging and the like. The query timer is in milliseconds (why doesn’t mysql show query times in microseconds? I had to resort to Perl + Time::HiRes for timing the Infobright queries).
I think it can be quite useful. However, I’m not quite sure it’s useful for “general-purpose” database use — there are a number of limitations (concurrency, for one) and it looks like it’s still fairly experimental.