Star Schema Benchmark is variation of TPC-H with star-schema database design.
In this run I use scale factor 1000, which gives about 610GB of raw text files with data
| InfoBright | InfiniDB | LuciDB | MonetDB |
|---|---|---|---|
| total: 862m59.538s ( 51779 sec ) | 400m10.814s (24010 sec) | 140736 sec (w/o indexes) | 710m5.268sec ( 42608 sec) (w/o indexes) |
| lineitem 851m47.869s | 706m15.405s | ||
| customers 8m18.600s | 2m54.698s | ||
| supplier 2m39.031s | 0m46.885s | ||
| part 0m13.841s | 0m8.178s | ||
| dates 0m0.197s | 0m0.102s |
| Original | InfoBright | InfiniDB | LucidDB | MonetDB |
|---|---|---|---|---|
| 610G | 112G | 626G | 127G (w/o indexes) | 650G |
| Query | InfoBright | InfiniDB | LucidDB | MonetDB |
|---|---|---|---|---|
| Q1.1 | 48 min 21.67 sec (2901.67 sec) | 24 min 26.05 sec (1466.05 sec) | 3503.792 sec | 6130.704 sec |
| Q1.2 | 44 min 55.37 sec (2695.37 sec) | 24 min 25.83 sec (1465.83 sec) | 2889.903 sec | |
| Q1.3 | 45 min 53.49 sec (2753.49 sec) | 24 min 27.25 sec (1467.25 sec) | 2763.464 sec | |
| Q2.1 | 1 hour 54 min 27.74 sec (6867.74) | 19 min 44.35 sec (1184.35 sec) | 9694.534 sec | |
| Q2.2 | 1 hour 13 min 33.15 sec (4413.15) | 19 min 49.56 sec (1189.56 sec) | 9399.965 sec | |
| Q2.3 | 1 hour 8 min 23.41 sec (4103.41) | 19 min 52.27 sec (1192.25 sec) | 8875.349 sec | |
| Q3.1 | NA | 19 min 11.23 sec (1151.23 sec) | 16376.93 sec | |
| Q3.2 | 3 hours 30 min 17.64 sec (12617.64 sec) | 19 min 28.55 sec (1168.55 sec) | 5560.977 sec | |
| Q3.3 | 2 hours 58 min 18.87 sec (10698.87 sec) | 19 min 58.29 sec (1198.29 sec) | 2517.621 sec | |
| Q3.4 | 1 hour 41 min 41.29 sec (6101.29 sec) | 12 min 57.96 sec (777.96 sec) | 686.202 sec | |
| Q4.1 | 8 hours 53 min 52.55 sec (32032.55 sec) | 32 min 57.49 sec (1977.49 sec ) | 19843.213 sec | |
| Q4.2 | 5 hours 38 min 7.60 sec / 5 hours 36 min 35.69 sec (20195.69 sec) | 33 min 35.45 sec (2015.45 sec) | 15292.648 sec | |
| Q4.3 | 12 hours 58 min 4.27 sec (46684.27 sec) | 33 min 47.32 sec (2027.32 sec) | 7241.791 sec |
InfoBright version is 3.2.0-ICE
InfoBright query Q3.1 finished in 36 hours with error, that query can not be executed by InfoBright optimizer.
InfiniDB version 0.9.5-alpha
InfiniDB vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 3 13 4980 112264 202040 3699992 0 0 128614 2 1324 19552 9 1 56 34 0 0 13 4980 112148 202048 3699988 0 0 130253 4 1330 20318 8 1 53 37 0 0 13 4980 112172 202048 3699996 0 0 128614 0 1344 20604 9 1 51 39 0 1 15 4980 111968 202056 3699996 0 0 126157 2 1322 19618 8 1 60 31 0 1 15 4980 111820 202064 3699996 0 0 125338 2 1338 17428 8 1 67 24 0 0 12 4980 111892 202072 3699996 0 0 131072 2 1326 22453 9 1 55 34 0 0 15 4980 111892 202080 3699992 0 0 132710 2 1345 18963 9 1 59 31 0 0 14 4980 111784 202088 3699996 0 0 133530 2 1329 19589 9 1 65 25 0 1 13 4980 111868 202088 3699996 0 0 124518 0 1338 20013 8 1 58 33 0
LucidDB version is 0.9.2 LucidDB notes
MonetDB version is Nov 2009
Problem with MonetDB is that needed indexes require a lot of memory, much more than available on box. MonetDB team suggest to setup swap partition 128GB, but I can't do that in current server disk partitions layout.
2x QuadCore CPU
vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU E5405 @ 2.00GHz stepping : 6 cpu MHz : 2000.042 cache size : 6144 KB
Software RAID 0 over two disks
Linux 2.6.18-164.el5 #1 SMP Thu Sep 3 03:28:30 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux CentOS release 5.4 (Final)
select
sum(lo_extendedprice*lo_discount) as revenue
from
lineorder, dates
where
lo_orderdate = d_datekey
and d_year = 1993
and lo_discount between 1 and 3
and lo_quantity < 25;
select
sum(lo_extendedprice*lo_discount) as revenue
from
lineorder, dates
where
lo_orderdate = d_datekey
and d_yearmonthnum = 199401
and lo_discount between 4 and 6
and lo_quantity between 26 and 35;
select
sum(lo_extendedprice*lo_discount) as revenue
from
lineorder, dates
where
lo_orderdate = d_datekey
and d_weeknuminyear = 6
and d_year = 1994
and lo_discount between 5 and 7
and lo_quantity between 26 and 35;
select
sum(lo_revenue), d_year, p_brand
from
lineorder, dates, part, supplier
where
lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and p_category = 'MFGR#12'
and s_region = 'AMERICA'
group by
d_year, p_brand
order by
d_year, p_brand;
select
sum(lo_revenue), d_year, p_brand
from
lineorder, dates, part, supplier
where
lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and p_brand between 'MFGR#2221' and 'MFGR#2228'
and s_region = 'ASIA'
group by
d_year, p_brand
order by
d_year, p_brand;
select
sum(lo_revenue), d_year, p_brand
from
lineorder, dates, part, supplier
where
lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and p_brand = 'MFGR#2239'
and s_region = 'EUROPE'
group by
d_year, p_brand
order by
d_year, p_brand;
select
c_nation, s_nation, d_year,
sum(lo_revenue) as revenue
from
customer, lineorder, supplier, dates
where
lo_custkey = c_custkey
and lo_suppkey = s_suppkey
and lo_orderdate = d_datekey
and c_region = 'ASIA'
and s_region = 'ASIA'
and d_year >= 1992 and d_year <= 1997
group by
c_nation, s_nation, d_year
order by
d_year asc, revenue desc;
select
c_city, s_city, d_year, sum(lo_revenue) as revenue
from
customer, lineorder, supplier, dates
where
lo_custkey = c_custkey
and lo_suppkey = s_suppkey
and lo_orderdate = d_datekey
and c_nation = 'UNITED STATES'
and s_nation = 'UNITED STATES'
and d_year >= 1992 and d_year <= 1997
group by
c_city, s_city, d_year
order by
d_year asc, revenue desc;
select
c_city, s_city, d_year, sum(lo_revenue) as revenue
from
customer, lineorder, supplier, dates
where
lo_custkey = c_custkey
and lo_suppkey = s_suppkey
and lo_orderdate = d_datekey
and (c_city='UNITED KI1' or c_city='UNITED KI5')
and (s_city='UNITED KI1' or s_city='UNITED KI5')
and d_year >= 1992 and d_year <= 1997
group by
c_city, s_city, d_year
order by
d_year asc, revenue desc;
select
c_city, s_city, d_year, sum(lo_revenue) as revenue
from
customer, lineorder, supplier, dates
where
lo_custkey = c_custkey
and lo_suppkey = s_suppkey
and lo_orderdate = d_datekey
and (c_city='UNITED KI1' or c_city='UNITED KI5')
and (s_city='UNITED KI1' or s_city='UNITED KI5')
and d_yearmonth = 'Dec1997'
group by
c_city, s_city, d_year
order by
d_year asc, revenue desc;
select
d_year, c_nation,
sum(lo_revenue - lo_supplycost) as profit
from
DATES, CUSTOMER, SUPPLIER, PART, LINEORDER
where
lo_custkey = c_custkey
and lo_suppkey = s_suppkey
and lo_partkey = p_partkey
and lo_orderdate = d_datekey
and c_region = 'AMERICA'
and s_region = 'AMERICA'
and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2')
group by
d_year, c_nation
order by
d_year, c_nation;
select
d_year, s_nation, p_category,
sum(lo_revenue - lo_supplycost) as profit
from
DATES, CUSTOMER, SUPPLIER, PART, LINEORDER
where
lo_custkey = c_custkey
and lo_suppkey = s_suppkey
and lo_partkey = p_partkey
and lo_orderdate = d_datekey
and c_region = 'AMERICA'
and s_region = 'AMERICA'
and (d_year = 1997 or d_year = 1998)
and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2')
group by
d_year, s_nation, p_category
order by
d_year, s_nation, p_category;
select
d_year, s_city, p_brand,
sum(lo_revenue - lo_supplycost) as profit
from
DATES, CUSTOMER, SUPPLIER, PART, LINEORDER
where
lo_custkey = c_custkey
and lo_suppkey = s_suppkey
and lo_partkey = p_partkey
and lo_orderdate = d_datekey
and s_nation = 'UNITED STATES'
and (d_year = 1997 or d_year = 1998)
and p_category = 'MFGR#14'
group by
d_year, s_city, p_brand
order by
d_year, s_city, p_brand;
Please report bugs to https://bugs.launchpad.net/percona-project/+filebug and ask general questions in maillist Percona-discussions
Discussion
I'm kind of new to the whole column oriented database world. But wouldn't some of these benchmarks benefit from a de-normalization of the star schema approach? Column oriented databases are usually great for analyzing large tables with many columns where few columns are used in the queries. From very unscientific and random tests I've done on performance on queries in these types of systems, a denomalized data model is much faster for query times even if it slows down the data load and (possibly) increase the total data size. Thoughts?
I agree with you - I usually work with normalized tables for the sake of size and modelling standards. It also helps on DB models for OOP. Still, a de-normalized table is normally what we get on our heads when we start creating the joins and everything else we'll need to make the db tables useful. It does remind me of the DBase III and DBase IV tables/databases used in the past, where we'd store every single value without thinking of data redundancy. It did work well. So, if you're going to look for small subsets of columns' values and you have medium to large data in your tables, using such an engine can make evrything work faster - if you avoid the normalization for the sake of performance, a column-oriented database engine can benefit a lot. I think this would be mostly useful for BI and Cubes, since we'll have a 'everyone against everyone' crossing.