Wikipedia statistics benchmark

Goals

There several goals of benchmark

  • Compare engines in OLAP queries for planning, predicting growth, analyzing access patterns to wiki pages, draw trends.
  • Compare engines in statistical queries for end users, which can be executed in real-time. I.e. How many times that or another page was accessed yesterday vs today.
  • Understand specific features and characteristic of each engine.
  • Compare throughput on simple queries
  • Check ability to load data and serve queries at the same time ( availability during data load )

Introduction

It's important for benchmark to use real data. I am going to take statistics of access to WikiPedia pages http://lists.wikimedia.org/pipermail/wikitech-l/2007-December/035435.html

Format of raw data is

PROJECT PAGE ACCESSED BYTES
...
aa Category:User_aa 1 6313                                                                                               
aa Main_Page 3 16473                                                                                                     
aa Special:RecentChangesLinked/User:Purodha 1 5701                                                                       
aa Special:Statistics 1 661                                                                                              
aa Talk:Main_Page 1 8092  
...

I made some efforts to normalize data, final tables you can see below

:!: Normalized data is available in AWS snapshot (“percona-wikistat”, ID:snap-a5f9bacc)

Tables

  • pages

Dictionary of pages. 724.550.811 rows. data size: 40476M

CREATE TABLE `pages` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `page` varchar(1024) NOT NULL,
  PRIMARY KEY (`id`)
)
  • projects

Projects dictionary, small table. 2025 rows.

CREATE TABLE `projects` (
  `id` smallint unsigned NOT NULL AUTO_INCREMENT,
  `project` varchar(50) NOT NULL,
  PRIMARY KEY (`id`)
)
  • datesinfo

Table with different part of date. 9624 rows.

CREATE TABLE `datesinfo` (
  `id` smallint unsigned NOT NULL AUTO_INCREMENT,
  `datestring` char(11) NOT NULL,
  `caldate` date NOT NULL,
  `calday` int(10) unsigned NOT NULL,
  `calmonth` tinyint(3) unsigned NOT NULL,
  `calyear` smallint(5) unsigned NOT NULL,
  `dayofweek` tinyint(3) unsigned NOT NULL,
  `dayhour` tinyint(3) unsigned NOT NULL,
  PRIMARY KEY (`id`)
)
+------+-------------+------------+--------+----------+---------+-----------+---------+
| id   | datestring  | caldate    | calday | calmonth | calyear | dayofweek | dayhour |
+------+-------------+------------+--------+----------+---------+-----------+---------+
...
| 3625 | 20090601-00 | 2009-06-01 |      1 |        6 |    2009 |         2 |       0 | 
| 3626 | 20090601-01 | 2009-06-01 |      1 |        6 |    2009 |         2 |       1 | 
| 3627 | 20090601-02 | 2009-06-01 |      1 |        6 |    2009 |         2 |       2 | 
| 3628 | 20090601-03 | 2009-06-01 |      1 |        6 |    2009 |         2 |       3 | 
| 3629 | 20090601-04 | 2009-06-01 |      1 |        6 |    2009 |         2 |       4 | 
| 3630 | 20090601-05 | 2009-06-01 |      1 |        6 |    2009 |         2 |       5 | 
| 3631 | 20090601-06 | 2009-06-01 |      1 |        6 |    2009 |         2 |       6 | 
| 3632 | 20090601-07 | 2009-06-01 |      1 |        6 |    2009 |         2 |       7 | 
| 3633 | 20090601-08 | 2009-06-01 |      1 |        6 |    2009 |         2 |       8 | 
| 3634 | 20090601-09 | 2009-06-01 |      1 |        6 |    2009 |         2 |       9 | 
...
+------+-------------+------------+--------+----------+---------+-----------+---------+

  • pagestat

Fact table. page_count is aggregated amount (sum) of page calls per hour.

Raw files per month:

  1. 200906: 68.352M / 3.453.013.109 rows
  2. 200907: 68.152M / 3.442.375.618 rows
  3. 200908: 69063M
  4. 200909: 69480M
  5. 200910: 77.558M / 3.919.634.163 rows
  6. 200911: 74.851M /
  7. 200912: TBD
CREATE TABLE `pagestat` (
  `date_id` smallint unsigned NOT NULL,
  `project_id` smallint unsigned NOT NULL,
  `page_id` int unsigned NOT NULL,
  `page_count` mediumint unsigned NOT NULL,
)

EER Diagram

Queries

Queries under consideration:

  • Q_UNIQ(Year, Month):
  
   SELECT count(distinct page_id) FROM pagestat JOIN datesinfo di ON ( di.id=date_id ) WHERE di.calmonth=6 and di.calyear=2009;
  • Q_DAYSTAT(Year, Month):
   SELECT caldate, sum(page_count) 
   FROM pagestat JOIN datesinfo di ON ( di.id=date_id ) 
   WHERE di.calmonth=7 and di.calyear=2009 
   GROUP BY caldate ORDER BY caldate;
  • Q_TOP_PROJECTS(N,Year,Month):
   SELECT project, sum(page_count) sm 
   FROM pagestat 
     JOIN datesinfo di ON ( di.id=date_id ) 
     JOIN projects p ON  (p.id=project_id ) 
   WHERE di.calmonth=7 and di.calyear=2009 
   GROUP BY project 
   ORDER BY sm DESC 
   LIMIT 20;
   
  • Q_PROJECT_SUM
   SELECT project_id, sum(page_count) sm
   FROM pagestat
   GROUP BY project_id;
  • Q_HOURPROJECTSTAT(project,Year,Month,Day):
   SELECT dayhour, sum(page_count) sm 
   FROM pagestat 
     JOIN datesinfo di ON ( di.id=date_id ) 
     JOIN projects p ON  (p.id=project_id ) 
   WHERE di.caldate='2009-07-21' AND project='es' 
   GROUP BY dayhour;

* Q_NOPROJECTS(Year,Month,Day)

   SELECT project FROM projects LEFT OUTER JOIN pagestat ON (pagestat.project_id = projects.id) WHERE pagestat.project_id IS NULL ORDER BY project; 
   
  • Q_NOPROJECTS_1(Year,Month,Day):
   SELECT project FROM projects 
    LEFT OUTER JOIN 
     (SELECT distinct project_id pid FROM pagestat WHERE date_id IN 
      (SELECT id FROM datesinfo WHERE calmonth=7 and calyear=2009 AND calday=14  )) t1 
    ON (projects.id=pid) 
   WHERE pid IS NULL ORDER BY project;
  • Q_DAYWEEKSTAT(project,Year,Month)
   SELECT dayofweek, sum(page_count) sm 
   FROM pagestat 
     JOIN datesinfo di ON ( di.id=date_id ) 
     JOIN projects p ON  (p.id=project_id ) 
   WHERE di.calmonth=6 and di.calyear=2009 AND project='de' 
   GROUP BY dayofweek;  
  • Q_TOP50(project,Year,Month,Day)
   SELECT page, sum(page_count) sm 
   FROM pagestat 
     JOIN datesinfo di ON ( di.id=date_id ) 
     JOIN projects p ON  (p.id=project_id ) 
     JOIN pages pp ON (pp.id=page_id) 
   WHERE di.calmonth=6 and di.calyear=2009 AND di.calday=18 
     AND project='de' 
   GROUP BY page 
   ORDER BY sm DESC 
   LIMIT 50;
  • Q_MAXPAGE(Year,Month,Day)
   SELECT page_id, max(page_count) mx
   FROM pagestat
     JOIN datesinfo di ON ( di.id=date_id ) 
   WHERE di.calmonth=6 and di.calyear=2009 AND di.calday=18 
   GROUP BY page_id
   ORDER BY mx DESC
   LIMIT 50;
  • Q_10000HITS_1(Year,Month,Day) - found pages with more than 10000 hits per hour
   SELECT page_id, max(page_count) mx
   FROM pagestat
     JOIN datesinfo di ON ( di.id=date_id ) 
   WHERE di.calmonth=7 and di.calyear=2009 AND di.calday=18 
   GROUP BY page_id
   HAVING max(page_count) > 10000
   LIMIT 50;
  
  • Q_10000HITS_2(Year,Month,Day) - found pages with more than 10000 hits per hour
   SELECT page,mx FROM pages 
    JOIN (SELECT page_id, max(page_count) mx  
          FROM pagestat    
          JOIN datesinfo di ON ( di.id=date_id )   
          WHERE di.calmonth=7 and di.calyear=2009 AND di.calday=18   
          GROUP BY page_id  
          HAVING max(page_count) > 10000  LIMIT 50) t 
    ON (t.page_id=pages.id);
 
  • Q_NOACCESS_SECOND_DAY(project_id, Year, Month, Dday): Find pages accessed more 1000 times one day and not accessed another
   SELECT page FROM pages JOIN 
    (SELECT t1.page_id FROM 
      (SELECT p1.page_id,sum(page_count) sm FROM pagestat p1 JOIN datesinfo di1 ON ( di1.id=p1.date_id ) 
       WHERE di1.calmonth=6 and di1.calyear=2009 AND di1.calday=18 AND p1.project_id=193 
       GROUP BY p1.page_id 
       HAVING sum(page_count)>1000)  t1 
      LEFT OUTER JOIN  
      (SELECT p2.page_id FROM pagestat p2 JOIN datesinfo di2 ON ( di2.id=p2.date_id ) 
       WHERE di2.calmonth=6 and di2.calyear=2009 AND di2.calday=19 AND p2.project_id=193) t2 
      ON (t2.page_id=t1.page_id)  WHERE t2.page_id IS NULL) bq 
   ON (pages.id=page_id);
  • Q_PAGE_MONTHSTATS(page,Year, Month)
   SELECT caldate, sum(page_count) FROM pagestat JOIN datesinfo ON ( pagestat.date_id = datesinfo.id)
   JOIN pages ON (pagestat.page_id=pages.id) WHERE page='Yugopolis' AND calmonth=7 and calyear=2009 GROUP by caldate ORDER BY caldate;
  • Q_PROJECT_AVG(Year, Month)
   SELECT project, AVG(page_count) FROM pagestat JOIN datesinfo ON ( pagestat.date_id = datesinfo.id)
   JOIN projects ON (pagestat.project_id=projects.id) WHERE calmonth=7 AND calyear=2009 GROUP by project;
  • Q_PAGES_PROJECT(Year,Month) Found pages that belong to several projects
   SELECT page_id,  count(distinct project_id) FROM pagestat JOIN datesinfo di ON ( di.id=date_id ) 
   WHERE di.calmonth=7 and di.calyear=2009 GROUP BY page_id HAVING count(distinct project_id) > 10 LIMIT 100;
  • Q_HOURS_DROP(Date, Hour, project_id): Find page that one hour was accessed two times more than next hour
   SELECT ps1.page_id, ps1.page_count 
   FROM pagestat ps1 
    JOIN datesinfo di1 ON ( ps1.date_id = di1.id) 
    JOIN pagestat ps2 ON (ps2.page_id=ps1.page_id) 
    JOIN  datesinfo di2 ON ( ps2.date_id = di2.id) 
  WHERE di1.caldate='2009-07-20' AND di1.dayhour=14 
    AND di2.caldate='2009-07-20' AND di2.dayhour=15 
    AND ps1.project_id=193 AND ps2.project_id=193 
    AND ps1.page_count > 2 * ps2.page_count 
    AND ps1.page_count>=1000 LIMIT 100; 
  • Build summary tables with per-day statistics
CREATE TABLE pagestat_daily (
   page_id int NOT NULL,
   date_id_from smallint  NOT NULL,
   date_id_to smallint  NOT NULL,
   project_id smallint  NOT NULL,
   page_count int  NOT NULL );

INSERT INTO pagestat_daily (project_id, page_id, date_id_from, date_id_to, page_count ) 
SELECT project_id, page_id, min(date_id), max(date_id), sum(page_count) 
FROM pagestat 
JOIN datesinfo ON (date_id=id) 
WHERE calmonth=6 AND calyear=2009 GROUP BY project_id, page_id,caldate;
  • Q_DATE_RANGE - date range query on summary table
   SELECT project, sum(page_count) sm 
   FROM pagestat_daily 
    JOIN datesinfo dstart ON ( date_id_from=dstart.id) 
    JOIN datesinfo dend ON (date_id_to=dend.id) 
    JOIN projects ON (project_id=projects.id) 
   WHERE dstart.caldate>='2009-06-15' AND dend.caldate <= '2009-06-20' 
   GROUP BY project HAVING sum(page_count) > 1000 
   ORDER BY sm DESC; 
  • Q_PAGE_DATE_RANGE - date range on summary table per page
   SELECT page, sm FROM pages, 
   ( SELECT page_id, sum(page_count) sm 
     FROM pagestat_daily 
      JOIN datesinfo dstart ON ( date_id_from=dstart.id) 
      JOIN datesinfo dend ON (date_id_to=dend.id) 
    WHERE dstart.caldate>='2009-06-15' AND dend.caldate <= '2009-06-20' 
   GROUP BY page_id ) dt 
   WHERE pages.id=dt.page_id ORDER BY sm DESC LIMIT 100;
  • Q_HIT_SPIKE - show max per hour spikes
   SELECT ps.page_id, MAX(ps.page_count/pd.page_count) AS max_daily_spikiness
   FROM pagestat ps
   INNER JOIN pagestat_daily pd ON ps.page_id=pd.page_id
     AND ps.date_id BETWEEN pd.date_id_from AND pd.date_id_to
   GROUP BY ps.page_id
   ORDER BY max_daily_spikiness DESC
   LIMIT 10;
  • QC_POINT_FACT Retrieve specific page on specific date from fact table
   SELECT * FROM pagestat ps1 
   JOIN datesinfo di1 ON ( ps1.date_id = di1.id) 
   WHERE page_id=64193807 AND project_id=193 
   AND di1.calday=20 AND di1.calmonth=7 
   AND di1.calyear=2009 AND di1.dayhour=15;
  • QC_POINT_PAGE(page_id)
 SELECT * FROM pages WHERE id=253608709;

* Retrieve stats for page / project for last hour

Engines

Engines in consideration:

MonetDB

Tables

CREATE TABLE pages (  id int NOT NULL , page varchar(1024) NOT NULL )
COPY 730000000 records INTO pages FROM '/data/vol2/wikistat/in/pages.txt';

Notes

Install pcre-6.6 with UTF-8 support ( http://www.fredsantos.net/index.php?option=com_content&view=article&id=114:unicode-support-on-centos-52-with-php-and-pcre&catid=36:linux&Itemid=77 )

rpm -ivh http://mirrors.kernel.org/centos/5.4/os/SRPMS/pcre-6.6-2.el5_1.7.src.rpm
change to %configure --enable-utf8 --enable-unicode-properties in pcre.spec

or

rpm -ivh http://gaarai.com/wp-content/uploads/2009/01/pcre-66-27x86_64.rpm

Greenplum

/etc/sysctl.conf
kernel.shmmax=500000000
kernel.shmmni=4096
kernel.shmall=4000000000
kernel.sem=250 64000 100 512
net.ipv4.tcp_tw_recycle=1

Tables

CREATE TABLE pages (  id int NOT NULL PRIMARY KEY, page varchar(1024) NOT NULL );
CREATE TABLE pagestat (
   date_id smallint  NOT NULL,
   project_id smallint  NOT NULL,
   page_id int NOT NULL,
   page_count int  NOT NULL )
WITH (appendonly=true, orientation=column, compresslevel=1, compresstype=QuickLZ);


gpfdist -d /mnt/data/ -p 8081 -l /home/gpadmin/log

CREATE EXTERNAL TABLE ext_pages ( id int, 
   page varchar(1024) ) 
   LOCATION ('gpfdist://localhost:8081/pages.txt')
 FORMAT 'TEXT' (DELIMITER '|');

CREATE EXTERNAL TABLE ext_pagestat (   date_id smallint , project_id smallint, page_id int ,page_count int )
LOCATION ('gpfdist://localhost:8081/200906.out') 
FORMAT 'TEXT' (DELIMITER '|');


Please note that this is NOT a place to ask questions or report bugs. This comment system is only intended for users to share tips and documentation additions concerning particular document.
Please report bugs to https://bugs.launchpad.net/percona-project/+filebug and ask general questions in maillist Percona-discussions

Discussion

Martin Kersten, 2010/01/03 03:50

The description is still rather sketchy and it lacks a clear objective for this benchmark. Are you intended to 'replace' weblog analyzers with a datawarehouse solution? Or do you want to support OLAP over aggregated weblog information?

For a weblog OLAP benchmark it lacks the properties people are often interested in. Ebay told me once they have about 15000 different attributes appearing in their log, all coming from different web apps. They are used for targeted OLAP.

This already gives an interesting design challenge. Do we need 15000 columns? or do we need a single property table (propname,propval). (I would go for 15000 columns)

Moreover, designing a benchmark in this arena should bring a USP to make impact. Any database system is likely to loose against filebase web-analysers supported in a map-reduce setting.

One area not addressed at all (except for TPH-D) is to recognize that a database is used in query sessions. This means that not a small set of individual queries are relevant for performance, but how the session evolves.

You can look at it as follows. Queries are characterised by templates, covering a number of tables and their join relationships. On top of this, there are scalars to reduce the attributes, and order-by constraints to make the result accessible. A query template can be summarized by QT( Ri, Sj, Jk, Ol) for Ri relations, Sj select arguments, Jk joins and Ol orderby descriptions.

Sessions evolve not randomly. People start with a template filled in (or generated through a front-end) and then based on the respond may slightly re-phrase it to zoom in on to their area of interest. This zoom behavior can affect R,S,J, and O.

Database administrators try to capture this behavior in terms of workloads and call upon the help of design wizards to provide the proper indices or tweak result caches.

A novelty in the DW space would be to come up with a benchmark where you can easily generate such session 'query trajectories', a small sequence of related queries. Systems are likely to differ if you confront them with such trajectories.

Vadim Tkachenko, 2010/01/03 11:15

Martin,

I added goal section, and you are right it's still draft.

As for representation, it is sure very far from eBay in both attributes and size dimensions, but working with many customer systems that stores user activity on site, it think my proposal should be good enough. Having three dimensions (time, projects, pages) I do not cover all possible attributes, but it is balance between complexity and making benchmark easy to understand.

I am going to test one of map-reduce solutions also, but to be objective, it has very low acceptance from end-users yet, and many prefer familiar SQL solution to map-reduce.

Alexander, 2010/03/02 18:53

Hi Vadim,

Could you please provide SQL script for Luciddb like you did for InfoBright, InfiniDB and Greenplum?

Would be really nice.

Enter your comment (wiki syntax is allowed):
CKFQM
 
benchmark/wikistat/start.txt · Last modified: 2010/03/21 06:56 (external edit)
 
Except where otherwise noted, content on this wiki is licensed under the following license:GNU Free Documentation License 1.2
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki