This is the first of a series of posts about business intelligence tools, particularly OLAP (or online analytical processing) tools using MySQL and other free open source software. OLAP tools are a part of the larger topic of business intelligence, a topic that has not had a lot of coverage on MPB. Because of this, I am going to start out talking about these topics in general, rather than getting right to gritty details of their performance.
I plan on covering the following topics:
- Introduction to OLAP and business intelligence. (this post)
- Identifying the differences between a data warehouse, and a data mart.
- Introduction to MDX queries and the kind of SQL which a ROLAP tool must generate to answer those queries.
- Performance challenges with larger databases, and some ways to help performance using aggregation.
- Using materialized views to automate that aggregation process.
- Comparing the performance of OLAP with and without aggregation over multiple MySQL storage engines at various data scales.
What is BI?
Chances are that you have heard the term business intelligence. Business intelligence (or BI) is a term which encompasses many different tools and methods for analyzing data, usually presenting it in a way that is easily consumed by upper management. This analysis is often used to determine how effectively the business has been at meeting certain performance goals, and to forecast how they will do in the future. To put it another way the tools are designed to provide insight about the business process, hence the name. Probably the most popular BI activity for web sites is click analysis.
As far as BI is concerned, this series of posts focuses on OLAP analysis and in a lesser sense, on data warehousing. Data warehouses often provide the information upon which OLAP analysis is performed, but more on this in post #2.
OLAP? What is that?
OLAP is an acronym which stands for online analytical processing. OLAP analysis, which is really just another name for multidimensional analysis, consists of displaying summary aggregations of the data broken down into different groups. A typical OLAP analysis might show “sale total, by year, by sales rep, by product category”. OLAP analysis is usually used for reporting on current data, looking at historical trends and trying to make predictions about future trends.
Multidimensional analysis is a form of statistical analysis. In multidimensional analysis samples representing a particular measure are compared or broken down into different dimensions. For example, in a sales analysis, the “sale amount” is a measure. Measures are always aggregated values. That is, total sales might be expressed as SUM(sale_amt). This is because the SUM of the individual sales will be grouped along different dimensions, such as by year or by product. I’m getting a little ahead of myself. Before we talk about measures and dimensions, we should talk about the two ways in which this information can be stored.
There are two main ways to store multidimensional data for OLAP analysis
OLAP servers typically come in two basic flavors. Some servers have specialized data stores which store data in a form which is highly effective for multidimensional analysis. These servers are termed MOLAP and they tend to have exceptional performance due to their specialized data store. Almost all MOLAP solutions pre-compute many (or even all) of the possible answers to multi-dimensional queries. Palo is an example of an open source version of this technology. ESSbase is an example of closed source product. MOLAP servers often feature extensive compression of data which can improve performance. Loading data into a MOLAP server usually takes a very long time because many of the answers in the cube must be calculated. The extra time spent during the load is usually called “processing” time.
A relational OLAP (or ROLAP) server uses data stored in an RDBMS. These systems trade the performance of a multidimensional store for the convenience of an RDBMS. These servers almost always query over a database which is structured as a STAR or snowflake type schema. To go back to the sales analysis example above, in a STAR schema the facts about the sales would be stored in the fact table, and the list of customers and products would be stored in separate dimension tables. Some ROLAP servers support the aggregation of data into additional tables, and can use the tables automatically. These servers can approach the performance of MOLAP with the convenience of ROLAP, but there are still challenges with this approach. The biggest challenges are the amount of time that it takes to keep the tables updated and in the complexity of the many scripts or jobs which might be necessary to keep the tables in sync. Part five of my series will introduce materialized views which attempt to address these challenges in a manageable way.
What makes a ROLAP so great?
An OLAP server usually returns information to the user as a ‘pivot table‘ or ‘pivot report’. While you could create such a report in a spreadsheet, the ROLAP tool is designed to deal with millions or even billions of rows of data, much more than a spreadsheet can usually handle. MOLAP servers usually require that all, or almost all of the data must fit it memory. Another difference is the ease by which this analysis is constructed. You don’t necessarily have to write queries or drag and drop a report together in order to analyze multidimensional data using an OLAP tool.
ROLAP tools use star schema
As I said before, a sale amount would be considered a measure, and it would usually be aggregated with SUM. The other information about the sale, such as the product, when it was sold and to whom it was sold would be defined in dimension tables. The fact table contains columns which are joined to the dimension tables, such as product_id and customer_id. These are often defined as foreign keys from the fact table to the dimension tables.
A note about degenerate dimensions:
Any values in the fact table that don’t join to dimensions are either considered degenerate dimensions or measures. In the example below the status of the order is a degenerate dimension. A degenerate dimension is stored as an ENUM in many cases. In the example below that there is no actual dimension table which includes the two different order statuses. Such a dimension would add an extra join, which is expensive. Any yes/no field and/or fields with a very low cardinality (such as gender or order status) will probably be stored in the fact table instead of in a dedicated dimension. In the “pivot data” example above, all the dimensions are degenerate: gender, region, style, date.
Often a dimension will include redundant information to make reporting easier, a process called “denormalization”. Hierarchical information may be stored in a single dimension. For example, a dimension for products may include both the category AND a sub-category. A time dimension includes year, month and quarter. You can create multiple different hierarchies from a single dimension. This allows ‘drill down’ into the dimension. By default the data would be summarized by year, but you can drill down to quarter or month level aggregation.
The screenshots here in the jPivot (an OLAP cube browser) documentation can give you a better idea about the display of data. The examples break down sales by product, by category, and by region.
The information is presented in such a fashion that it can be “drilled into” and “filtered on” to provide an easy to use interface to the underlying data. Graphical display of the data as pie, line or bar charts is possible.
Focusing on ROLAP.
This is the MySQL performance blog, and as such an in depth discussion of MOLAP technology is not particularly warranted here. Our discussion will focus on Mondrian. Mondrian is an open source ROLAP server featuring an in-memory OLAP cache. Mondrian is part of the Pentaho open source business intelligence suite. Mondrian is also used by other projects such as Wabit and Jaspersoft. If you are using open source BI then you are probably already using Mondrian. Closed source ROLAP servers include Microstrategy, Microsoft Analysis Services and Oracle BI.
Mondrian speaks MDX, olap4j and XML for analysis. This means that there is a very high chance that your existing BI tools (if you have them) will work with it. MDX is a query language that looks similar to SQL but is actually very different. Olap4j is an OLAP interface for java applications. XML for analysis (XMLA) is an industry standard analytical interface originally created by Microsoft, SAS and Hyperion.
Next we’ll talk about the difference between data marts and data warehouses. The former are usually used for OLAP analysis, but they can be fundamentally related to a warehouse.