New OLAP Wikistat benchmark: Introduction and call for feedbacks

I’ve seen my posts on Ontime Air traffic and Star Schema Benchmark got a lot of interest

However benchmarks by itself did not cover all cases I would want, so I was thinking about better scenario. The biggest problem is to get real big enough dataset, and I thank to Bradley C. Kuszmaul, he pointed me on Wikipedia statistics on access to Wikipedia pages, and thank to Domas, who made stats accessible. Link to the archives: or the original Domas’s announcement .

Although the table does not have very much different information, I think it is good enough to represent cases you can face in Web application ( log processing, page visits, clickstream, etc).

I made some efforts to normalize data to have model in classic star schema and prepared queries that could be run on proposed dataset (John Sichi, lead of LucidDB helped me to draft some queries).
You can see details on our Percona Wikistat benchmark Wiki.

I have next goals with proposed benchmark:

  • Compare engines in OLAP queries for planning, predicting growth, analyzing access patterns to wiki pages, draw trends.
  • Compare engines in statistical queries for end users, which can be executed in real-time. I.e. How many times that or another page was accessed yesterday vs today.
  • Understand specific features and characteristic of each engine.
  • Compare throughput on simple queries (queries and scenario to be drafted yet)
  • Check ability to load data and serve queries at the same time ( availability during data load ) (queries and scenario to be drafted yet)

So in proposed schema I have four tables:

(fact table), and

(dimensions tables).

Dimensions tables are supposed to be static and not changed, and we can change datasize
by varying amount of months loaded into fact table (so this is scale factor).

EER diagram

( made with MySQL Workbench )

In current dataset, which you can download from Amazon snapshot (name: “percona-wikistat”, ID:snap-a5f9bacc) we have:

  • Table : 724.550.811 rows. data size: 40476M
  • Table : 9624 rows, one entry represents 1 hour
  • Table 2025 rows
  • Table
    Data for 2009-06: # 3.453.013.109 rows / size 68352M
    Data for 2009-07: # 3.442.375.618 rows / size 68152M

So with two months of stats we have about 172GB of data with about 7 billion rows in fact table.

Example of query ( again, full list on Benchmark Wiki)

I am going to load data and run queries against available engines:

  • MySQL MyISAM / InnoDB (to have reference results)
  • InfoBright
  • InfiniDB
  • MonetDB
  • LucidDB
  • Greenplum

and I will report my results ( so stay with MySQLPerformanceBlog ;))

I’d like also to test also Paraccel, Vertica and KickFire systems, but I do not have access to.

I welcome your feedback on the benchmark, and what else you would like to see here.

Share this post

Comments (16)

  • Roland Bouman Reply


    I can’t wait 🙂

    January 29, 2010 at 1:13 am
  • Alexander Reply

    Hi Vadim —

    It should be very interesting test. Thanks for including GreenPlum here.

    In our company we are currently testing star schema performance for MySQL MyISAM, MySQL TokuDB, InfiniDB, GreenPlum and Vertica. It will be interesting to compare our results with yours.


    January 29, 2010 at 4:16 am
  • David Prime Reply

    I’m very interested in the continuous query abilites of these engines and will look forward to the results as they come in. Thanks for doing the work, Vadim.


    January 29, 2010 at 6:29 am
  • Rob Wultsch Reply

    If you are benching Greenplum which is a proprietary fork of Postgres, why not also bench Pg for comparison sake?

    January 29, 2010 at 2:08 pm
  • Vadim Reply


    GreenPlum allows column-oriented tables + can utilize all CPUs executing single query.
    Postgres can’t do afaik.

    January 29, 2010 at 2:32 pm
  • Slapo Reply

    Could you consider adding Firebird 2.5 RC in SuperClassic mode to the comparison?
    There aren’t that many benchmarks of it and this could be interesting to see.

    One other thing: have you considered using a db design used by a content management system (Drupal, TYPO3) or an e-commerce solution (Opencart, Magento, …).

    Adding vanilla Postgres would still be interesting just to see how it compares to Greenplum.

    Thanks 🙂

    January 29, 2010 at 3:10 pm
  • Vadim Reply


    I do not think db design in CMS suitable for OLAP benchmarks.
    Either I do not think OLTP oriented FireBird and PostgresSQL will be good for proposed benchmark.

    I expect that queries will take 5-1000 mins to execute, and if we do not use special engine,
    it will be just pain. I am going to ran on MySQL MyISAM / InnoDB only to have baseline.

    January 29, 2010 at 4:20 pm
  • Rob Wultsch Reply

    I was suggest testing Postgres because I imagine it would have very little increased difficulty over testing Greenplum. I do not think that you are incorrect in that performance would probably be lacking, I just thought that it would be another interesting baseline.

    January 30, 2010 at 1:27 pm
  • Slapo Reply


    I suggested what I wrote before because someone actually might want to do some analytics on a cms or an e-shop – visits, visitors, countries, extensive products history, etc.
    I see your point of it being perhaps too lengthy to bother with it, though.

    January 31, 2010 at 7:53 am
  • Vadim Reply


    There are two factors:
    1) I am not expert in PostgreSQL
    2) Results, which directly compare PostregSQL vs MySQL, usually cause holy wars, and one who presents them, is claimed unqualified a) in PostgreSQL, b) in MySQL, c) in both.
    So I do not want to go that way.

    However if there is expert who wants to install PostgreSQL, tune it, and run proposed scenarios,
    I would be happy to even provide access to box.

    January 31, 2010 at 7:18 pm
  • Jos van Dongen Reply

    Hi Vadim,

    Sounds great! What kind of box will you be using and how much memory/nr of disks will it have? E.g. Infobright compresses data quite aggressively so your 172GB will probably fit in 64 or even 32GB of memory, and Greenplum likes lots of disks (at least 1 data segment per CPU core, so with a dual quadcore machine you get better results with 8 partitions on separate drives). What you should also try to do is simulate concurrent access by firing multiple concurrent query streams. Some database are fast when running one query at a time but get crippled when the have to perform multiple tasks simultaneously.

    About the ‘holy wars’ you mention: look at where I ran TPCH using Greenplum and got discussions between PostgreSQL and GP advocates…

    anyway, good luck with the benchmark and if you need help (or access to my machine ;-), just let me know.

    best, Jos

    February 1, 2010 at 12:32 am
  • Rob Wultsch Reply

    I dropped a note to the PG advocacy list in hopes of getting an expert to help with tuning.

    I *wish* I was (and hope to be someday) that expert.

    February 1, 2010 at 1:40 am
  • David Reply

    I’d like to see the performance of the following query:
    SELECT project_id, sum(page_count) sm
    FROM pagestat
    GROUP BY project_id;

    This query is important to me because it look at ALL the data in the fact table without any joins. This sort of queries gives an overview of all data, which is typical in my field of data mining. I suspect that row-based engines will perform better for this one.

    February 8, 2010 at 8:25 am
  • Vadim Reply


    I added your query, but I’d like to have at least date restriction,
    as we may have different amount of data loaded into fact table, depending on scale factor.

    February 8, 2010 at 8:49 pm
  • David Reply


    Thanks for you answer. I see your point, though the problem is that queries that do not have any restrictions are essential for answering the following typical business questions (do not know whether they can be answered with the data in your example database):
    – How many visits have we had so far in total?
    – How many pages has been created?
    – What is the distribution of visits per page?
    – How has the development of page creations been?

    All these sorts of questions run on all the data. I believe that looking at fractions of data is more the IT-perspective on data.

    February 9, 2010 at 1:42 am
  • Vadim Reply


    Sure, that makes sense.
    Just in current enviroment it may be runaway query,
    which will take days to execute for some engines…

    February 9, 2010 at 8:39 am

Leave a Reply