I’ve seen my posts on Ontime Air traffic and Star Schema Benchmark got a lot of interest
(links:
).
However benchmarks by itself did not cover all cases I would want, so I was thinking about better scenario. The biggest problem is to get real big enough dataset, and I thank to Bradley C. Kuszmaul, he pointed me on Wikipedia statistics on access to Wikipedia pages, and thank to Domas, who made stats accessible. Link to the archives: http://dammit.lt/wikistats/archive/ or the original Domas’s announcement .
Although the table does not have very much different information, I think it is good enough to represent cases you can face in Web application ( log processing, page visits, clickstream, etc).
I made some efforts to normalize data to have model in classic star schema and prepared queries that could be run on proposed dataset (John Sichi, lead of LucidDB helped me to draft some queries).
You can see details on our Percona Wikistat benchmark Wiki.
I have next goals with proposed benchmark:
So in proposed schema I have four tables:
|
1 |
pagestat |
(fact table), and
|
1 |
pages, datesinfo, projects |
(dimensions tables).
Dimensions tables are supposed to be static and not changed, and we can change datasize
by varying amount of months loaded into fact table (so this is scale factor).
EER diagram

( made with MySQL Workbench )
In current dataset, which you can download from Amazon snapshot (name: “percona-wikistatâ€, ID:snap-a5f9bacc) we have:
|
1 |
pages |
|
1 |
datesinfo |
|
1 |
projects: |
|
1 |
pagestats |
So with two months of stats we have about 172GB of data with about 7 billion rows in fact table.
Example of query ( again, full list on Benchmark Wiki)
|
1 2 3 4 5 6 7 8 |
SELECT project, sum(page_count) sm FROM pagestat JOIN datesinfo di ON ( di.id=date_id ) JOIN projects p ON (p.id=project_id ) WHERE di.calmonth=7 and di.calyear=2009 GROUP BY project ORDER BY sm DESC LIMIT 20; |
I am going to load data and run queries against available engines:
and I will report my results ( so stay with MySQLPerformanceBlog ;))
I’d like also to test also Paraccel, Vertica and KickFire systems, but I do not have access to.
I welcome your feedback on the benchmark, and what else you would like to see here.