September 23, 2014

Data mart or data warehouse?

This is part two in my six part series on business intelligence, with a focus on OLAP analysis.

  1. Part 1 – Intro to OLAP
  2. Identifying the differences between a data warehouse and a data mart. (this post)
  3. Introduction to MDX and the kind of SQL which a ROLAP tool must generate to answer those queries.
  4. Performance challenges with larger databases, and some ways to help performance using aggregation.
  5. Using materialized views to automate that aggregation process.
  6. Comparing the performance of OLAP with and without aggregation over multiple MySQL storage engines at various data scales.

What is a data warehouse?
It turns out that this question is a little more difficult to answer than it probably should be. This is because data warehousing has become an overloaded term that includes BI tools (OLAP/data mining), data extraction and transformation tools (ETL), and schema management tools.

To me, the definition of data warehouse is “A relational database schema which stores historical data and metadata from an operational system or systems, in such a way as to facilitate the reporting and analysis of the data, aggregated to various levels.” This definition is a consolidation of various definitions that I have encountered.

There are a few key points here. First, data warehouses rarely contain information that exists no where else in an organization. The goal of data warehousing is to collect and make a historical record of the information from another system. This might be an ERP application, the logs from a web application, data from manufacturing systems or even data from radio telescopes. This data is extracted from the source system(s) and then cleaned up and inserted into the data warehouse with ETL tools. This process is usually called “conforming” the source data into the warehouse schema. Another important aspect of the definition is aggregation. A data warehouse is usually used to summarize data over years, months, quarters, or other time dimension attributes. This aids in identifying historical trends and making predictions about future trends. Data is often aggregated in many different ways. Aggregated data may be stored in aggregated tables so that it can be accessed quickly. This is particularly important as fact tables reach into the billions of rows and hundreds of gigabytes of information is accumulated. Accessing this data outside of summarized form often takes a very long time.

Is there a particular schema design which lends itself to this historical analysis?
There are two main methodologies which are practiced when it comes to designing database schemata for database warehouse applications. These two methodologies approach the problem of storing data in very different ways.

The first methodology was popularized by Bill Inmon, who is considered by many to be the “father” of the data warehouse, or at least the first dw “evangelist” if you will. This approach focuses on the normalization of data. Highly normalized schema are created and maintained by ETL jobs. Creating and maintaining these jobs is often one of the biggest parts of designing and running a data warehouse. A particular combination of ETL jobs which consist of one or more data transformations is usually called a “flow”. An example ETL flow might combine data from item and category information into a single dimension, while also maintaining the historical information about when each item was in each category.

These types of warehouses are almost always “insert only”. Data is very likely never updated or deleted in these databases and they are expected to grow to very large sizes, usually into the terabyte range, but sometimes even into petabytes. Aggregations are the exception to this rule, as they must be updated periodically to reflect the additions of data to the source tables. The goal of this methodology is the 100% accurate description of historical data from the operational system in a normalized manner which ensures that it is able to be updated quickly. It is accepted that analysis of the data will be more complex in this form, but that this complexity is an acceptable trade off for historical accuracy. This is often described as the “top-down” approach.

What is a data mart?
The second approach, popularized by Ralph Kimball holds that partial de-normalization of the data is beneficial. The goal of a this approach is usually multi-dimensional (OLAP) analysis as it is very hard to create a dimensional model from a highly normalized database schema. It is particularly difficult to build such a model that scales as the volume in the warehouse increases. For this reason OLAP analysis usually is performed on a star schema which partially denormalizes the data. A star schema about a particular subject matter, such as sales, is usually referred to as a “data mart”. Maybe this is because they provide one stop shopping for all the information about the particular subject matter. That is pretty much what I imagine when I hear the phrase.

Data marts tend to be updated frequently, at least once per day. As I mentioned in my previous post, a star schema consists of a central table called the fact table and additional dimension tables which contain information about the facts, such as lists of customers or products. Because of the partially denormalized nature of a star schema, the dimension tables in a data mart may be updated. In fact, there is a term for such a dimension – A “slowly changing dimension” or SCD. The fact table is usually only inserted to, but older data may be purged out of it. Sometimes the fact table will be aggregated from source data. A website which sells banner ads might roll up all the events for a particular ad to the day level, instead of storing detailed information about every impression and click for the ad.

A normalized data warehouse schema might contain tables called items, categories and item_category. These three tables allow a user to determine which items belong to which categories, but this structure creates a large number of joins when many dimensions are involved. A data mart would collapse all of this information into an item dimension which would include the category information in the same row as the item information. It would be possible create two different dimensions, product and category, but performance tends to decrease as the number of dimensions increases.

The difference illustrated
In this mock ERD diagram you can three schemata representing sales orders. The star schema is very denormalized, having only four tables which represent the subject. The data warehouse schema, on the other hand, is very normalized and requires tens of tables to represent the same subject. The snowflake schema is a compromise between the two extremes.

Mock ERD Diagram

Typical star, snowflake and data warehouse schemata.

Is one better than the other?
In the game of data warehousing, a combination of these methods is of course allowed. A company might take the top-down approach where they maintain a large historical data warehouse, but they also build data marts for OLAP analysis from the warehouse data. A different approach is to build a relational warehouse from multiple data marts, or the so-called bottom-up approach to data warehousing.

There is also a cousin of the star schema in which the dimensions are normalized. This type of schema is usually called a snowflake schema. The three table item/category/item_category tables in the warehouse schema example would be considered a snowflake. A dimension table (item) must be joined to additional tables (item_category,category) to find the category. These are not as popular as star schemas because they tend to not perform as well as a star schema, particularly as the volume of data in the database increases.

So what is the big deal?
From a OLAP performance standpoint, many databases will perform better on a star schema than on a snowflake or fully normalized schema at data warehouse volumes. This is in large part because commercial database software supports hash joins, bitmap indexes, table partitioning, parallel query execution, clustered tables and materialized views. These features make working with a star schema much easier than it may be on MySQL, but it is definitely possibly to use MySQL as long as the right tools and techniques are used. There are ways in which can add some of these features to MySQL as well, but that is a topic for a later post.

In the next post I’ll talk more about Mondrian and about MDX, the multi-dimensional query language. Mondrian turns MDX into SQL, so we’ll also look at the kinds of queries which are generated by OLAP analysis.

About Justin Swanhart

Justin is a Principal Support Engineer on the support team. In the past, he was a trainer at Percona and a consultant. Justin also created and maintains Shard-Query, a middleware tool for sharding and parallel query execution and Flexviews, a tool for materialized views for MySQL. Prior to working at Percona Justin consulted for Proven Scaling, was a backend engineer at Yahoo! and a database administrator at Smule and Gazillion games.

Comments

  1. Nice writeup, and a great series – keep it up!

    two things I noticed:

    #1: “The fact table is usually only inserted to, but older data may be purged out of it.”
    I’d say this is true in general, but there is one particular case of fact table that is frequently updated – it is the type of fact table that Ralph Kimball calls the accumulating snapshot. In this kind of fact table, one row represents one single business process, and as the process develops in time and acquires a new state (out of a set of pre-defined states) the row is updated to store all data relevant to that particular state.

    A simple example can be set up for the sakila sample database: the rental process has a least two distinct states, the rental and the return. So an accumulating snapshot would at least include a link to the date dimension for the rental data, and one for the return date. The former would be filled in when the fact row is created, the latter would be updated as soon as return occurs. A useful metric to record would be the rental duration, which would be updated also at the time of the return.

    #2: “Ralph Kimball holds that partial de-normalization of the data is beneficial.”
    Yes, but I’d say it’s more specific than that – it’s not just denormalized, the main point of Kimballs method is to use a dimensional model, with a single central fact table (which is typically normalized) to store the metrics of interest and link them to the dimension tables to record context in which they were measured.

    From this perspective the snowflake and the star are both examples of a dimension model, it’s just that in a star the dimensions are implemented using a single (denormalized) dimension table, whereas in the snowflake the dimension tables are normalized (to some extent), typically having a single dimension table for each level for the hierarchies along which the metrics are to be rolled up.

  2. Justin Swanhart says:

    Thanks Roland! Good points.

    The accumulating snapshot is a snapshot (aka materialized view or summary table). It might be built from two tables containing rental and return information. Since the fact table itself is a summary, it is excepted from the “insert only” rule. The same thing is true for fact tables that are aggregated to a particular grain. These tables are often inserted into with “ON DUPLICATE KEY UPDATE” and the measures are adjusted appropriately.

  3. Hi Justin!

    I don’t think the “snapshot” in “accumulating snapshot” is the one you’re thinking of – I mean what is discussed in this article:

    http://www.rkimball.com/html/designtipsPDF/DesignTips2002/KimballDT37ModelingPipeline.pdf

    So, just a normal fact table, no aggregation or materialized views going on.

  4. Justin Swanhart says:

    Thanks, I understand what you mean now.

  5. James says:

    These articles have been really interesting/useful. Will the rest of the six post series appear on the blog at some stage?

  6. Justin Swanhart says:

    Hi James,

    I simply haven’t had the free time lately to work on the next post.

    I am, however, going to be flying next week. That will likely give me some time to work on this.

    Stay tuned. As long as all the stars line up properly the next post will be out sometime in the next two weeks. :)

  7. James says:

    hurray! Thanks for the reply

  8. faruk says:

    I have written a short paper about this subject, so anyone is welcome to read!
    Check the link: http://faruk.ba/site/?p=87

  9. Per says:

    “A normalized data warehouse schema might contain tables called items, categories and item_category. These three tables allow a user to determine which items belong to which categories, but this structure creates a large number of joins when many dimensions are involved. A data mart would collapse all of this information into an item dimension which would include the category information in the same row as the item information.”

    I don’t get it. This schema suggests that there is a many-to-many relationship between items ans categories. (Those are considered dimensions, not facts, right?) But how can the items table row have all its categories in a single column?

    (Yes, I have exactly this problem in my data mart with SCD, and it requires some brutal joins).

    Your web page is otherwise excellent. First-class style; Clear, concise and complete. You should publish a book on the subject!

  10. Per,

    indeed, solving many to many relationships in a star schema is a challenge. If you want to google for it, look for “multi-valued dimensions”
    There are solutions though, and there isn’t one ‘right’ answer – it depends on the requirements.

    In some cases, it’s acceptable to create a multivalued member in the dimension table: say, a list of categories.
    A slightly more structured solution is to create a separate flag column for each category (and yes, dimension table will need to be altered whenever a new category is added)
    Another solution is to use a bridge table with an allocation factor. In this case the categories would be a separate dimension and have an intersection table with the fact table – the “bridge table”. In this table, you’d store a factor that expresses the partial contribution of the dimension entry to the fact entry. I can’t think of a good example for this approach in the product/category example, but I have set up an example in the “Pentaho Solutions” book that uses this approach to have an “actor” dimension table for film customer orders.

    The actor/film customer order example works like this: For each actor that stars in a film this bridge table contains an actor_id, and a film_id and a factor that is 1/#number of actors in the film. Sometimes this is called a weight and it serves to model the relative contribution of each actor. This way, if you do a query like “What is the order value for films starring a particular actor”, you can multiply the metric value from the fact table with the weight and still get a result that makes sense – kinda. You really need the weight for this type of query if you want to calculate the value of multiple actors: for example, if you’re asking about the value of all customer orders for films starring Robert de Niro or Al Pacino, you want to prevent counting the films starring both Robert de Niro and Al Pacino twice.

    All these approaches are explained here too: http://www.pythian.com/news/364/implementing-many-to-many-relationships-in-data-warehousing/
    Or if you’re interested, pick up a copy of “Pentaho Solutions” – apart from being an all-round pentaho starter’s guide it also explains these basic data warehousing techniques, and illustrates them with examples.

  11. faruk says:

    I have written a short paper about this subject, so anyone is welcome to read!
    Check the link: http://faruk.ba/?p=87

  12. Juan says:

    Hi congrats for the article. It’s very impressive. I’d like to know when you will be finishing the others topics from the list.

    I’m learning the OLAP/OLTP/Cubes concepts and i need some guide.

    thanks

  13. Hi Juan,

    not trying to hijack the thread, but I co-authored a book on BI and data warehousing which is, even if I do say so myself, a pretty good mix between theory and hands-on. It’s mainly about Pentaho, but it contains an extensive example case to build a (kimball-style) data warehouse using MySQL. You can find the book here on amazon:

    http://www.amazon.com/Pentaho-Solutions-Business-Intelligence-Warehousing/dp/0470484322

    You can get a sample chapter, toc and index here:

    http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470484322.html

  14. Art says:

    Great article. Did you do a part 3-6? I did not see them in your posts. I would appreciate an email if you did. Thanks.

  15. Great and very interesting blog. I think it’s also an informative. Thanks for sharing.

    Please visit our website: http://www.letusbeyourwarehouse.com./

  16. Ritwick says:

    Thanks for the article as i was looking for a difference between datamart and data warehouse for my engineering assignment. Nicely written and easy to understand. Thanks.

  17. David says:

    So far this series is off to a great start.

    Did you ever get a chance to write up the other parts to this series?

  18. Andre says:

    The first two entries in this series had really been fantastic, its a shame that it seems to have been abandoned. Justin, if you do decide to complete the series, I know it will be greatly appreciated.

  19. Elliot says:

    This series has been really helpful. I know its really old that this point but I was really looking forward to the 5th post in the series.

  20. Lars Nymand says:

    It is so sad that this series stopped. The articles was great and easy reading. I hope you find the time to finish it.

Speak Your Mind

*