This is part two in my six part series on business intelligence, with a focus on OLAP analysis.
What is a data warehouse?
It turns out that this question is a little more difficult to answer than it probably should be. This is because data warehousing has become an overloaded term that includes BI tools (OLAP/data mining), data extraction and transformation tools (ETL), and schema management tools.
To me, the definition of data warehouse is “A relational database schema which stores historical data and metadata from an operational system or systems, in such a way as to facilitate the reporting and analysis of the data, aggregated to various levels.” This definition is a consolidation of various definitions that I have encountered.
There are a few key points here. First, data warehouses rarely contain information that exists no where else in an organization. The goal of data warehousing is to collect and make a historical record of the information from another system. This might be an ERP application, the logs from a web application, data from manufacturing systems or even data from radio telescopes. This data is extracted from the source system(s) and then cleaned up and inserted into the data warehouse with ETL tools. This process is usually called “conforming” the source data into the warehouse schema. Another important aspect of the definition is aggregation. A data warehouse is usually used to summarize data over years, months, quarters, or other time dimension attributes. This aids in identifying historical trends and making predictions about future trends. Data is often aggregated in many different ways. Aggregated data may be stored in aggregated tables so that it can be accessed quickly. This is particularly important as fact tables reach into the billions of rows and hundreds of gigabytes of information is accumulated. Accessing this data outside of summarized form often takes a very long time.
Is there a particular schema design which lends itself to this historical analysis?
There are two main methodologies which are practiced when it comes to designing database schemata for database warehouse applications. These two methodologies approach the problem of storing data in very different ways.
The first methodology was popularized by Bill Inmon, who is considered by many to be the “father” of the data warehouse, or at least the first dw “evangelist” if you will. This approach focuses on the normalization of data. Highly normalized schema are created and maintained by ETL jobs. Creating and maintaining these jobs is often one of the biggest parts of designing and running a data warehouse. A particular combination of ETL jobs which consist of one or more data transformations is usually called a “flow”. An example ETL flow might combine data from item and category information into a single dimension, while also maintaining the historical information about when each item was in each category.
These types of warehouses are almost always “insert only”. Data is very likely never updated or deleted in these databases and they are expected to grow to very large sizes, usually into the terabyte range, but sometimes even into petabytes. Aggregations are the exception to this rule, as they must be updated periodically to reflect the additions of data to the source tables. The goal of this methodology is the 100% accurate description of historical data from the operational system in a normalized manner which ensures that it is able to be updated quickly. It is accepted that analysis of the data will be more complex in this form, but that this complexity is an acceptable trade off for historical accuracy. This is often described as the “top-down” approach.
What is a data mart?
The second approach, popularized by Ralph Kimball holds that partial de-normalization of the data is beneficial. The goal of a this approach is usually multi-dimensional (OLAP) analysis as it is very hard to create a dimensional model from a highly normalized database schema. It is particularly difficult to build such a model that scales as the volume in the warehouse increases. For this reason OLAP analysis usually is performed on a star schema which partially denormalizes the data. A star schema about a particular subject matter, such as sales, is usually referred to as a “data mart”. Maybe this is because they provide one stop shopping for all the information about the particular subject matter. That is pretty much what I imagine when I hear the phrase.
Data marts tend to be updated frequently, at least once per day. As I mentioned in my previous post, a star schema consists of a central table called the fact table and additional dimension tables which contain information about the facts, such as lists of customers or products. Because of the partially denormalized nature of a star schema, the dimension tables in a data mart may be updated. In fact, there is a term for such a dimension – A “slowly changing dimension” or SCD. The fact table is usually only inserted to, but older data may be purged out of it. Sometimes the fact table will be aggregated from source data. A website which sells banner ads might roll up all the events for a particular ad to the day level, instead of storing detailed information about every impression and click for the ad.
A normalized data warehouse schema might contain tables called items, categories and item_category. These three tables allow a user to determine which items belong to which categories, but this structure creates a large number of joins when many dimensions are involved. A data mart would collapse all of this information into an item dimension which would include the category information in the same row as the item information. It would be possible create two different dimensions, product and category, but performance tends to decrease as the number of dimensions increases.
The difference illustrated
In this mock ERD diagram you can three schemata representing sales orders. The star schema is very denormalized, having only four tables which represent the subject. The data warehouse schema, on the other hand, is very normalized and requires tens of tables to represent the same subject. The snowflake schema is a compromise between the two extremes.
Is one better than the other?
In the game of data warehousing, a combination of these methods is of course allowed. A company might take the top-down approach where they maintain a large historical data warehouse, but they also build data marts for OLAP analysis from the warehouse data. A different approach is to build a relational warehouse from multiple data marts, or the so-called bottom-up approach to data warehousing.
There is also a cousin of the star schema in which the dimensions are normalized. This type of schema is usually called a snowflake schema. The three table item/category/item_category tables in the warehouse schema example would be considered a snowflake. A dimension table (item) must be joined to additional tables (item_category,category) to find the category. These are not as popular as star schemas because they tend to not perform as well as a star schema, particularly as the volume of data in the database increases.
So what is the big deal?
From a OLAP performance standpoint, many databases will perform better on a star schema than on a snowflake or fully normalized schema at data warehouse volumes. This is in large part because commercial database software supports hash joins, bitmap indexes, table partitioning, parallel query execution, clustered tables and materialized views. These features make working with a star schema much easier than it may be on MySQL, but it is definitely possibly to use MySQL as long as the right tools and techniques are used. There are ways in which can add some of these features to MySQL as well, but that is a topic for a later post.
In the next post I’ll talk more about Mondrian and about MDX, the multi-dimensional query language. Mondrian turns MDX into SQL, so we’ll also look at the kinds of queries which are generated by OLAP analysis.