Flexviews – part 3 – improving query performance using materialized views

Combating “data drift”

In my first post in this series, I described materialized views (MVs). An MV is essentially a cached result set at one point in time. The contents of the MV will become incorrect (out of sync) when the underlying data changes. This loss of synchronization is sometimes called drift. This is conceptually similar to a replication slave that is behind. Until it catches up, the view of the data on the slave is “behind” the changes on the master. An important difference is that each MV could have drifted by a different length of time.

A view which has drifted out of sync must be refreshed. Since an MV drifts over time from the “base tables” (those tables on which the view was built) there must be a process to bring them up-to-date. Flexviews includes two different refresh methods. Each method is named after the way in which the contents of the view are updated. The two methods are complete refresh and incremental refresh.   The view type/refresh method are interchangeable since a refresh method is selected at the time of view creation and may not be changed. If I talk about a ‘complete MV’, I mean a ‘MV for which the complete refresh method was selected at creation’.  Also, when I say ‘view’ I mean MV.

I will fully describe each method below, but the important thing to know right now is that there are two significant differences between the methods.

  • A view which uses the complete refresh method must be rebuilt from scratch each time the view is refreshed. This is similar to storing the results of a query in Memcache. When the cache “expires”, the contents must be fully recalculated..
  • An incrementally refreshable view (that is, one which uses the incremental method) can be updated much more efficiently, by examining a history of the rows which have changed since the view was either created, or last refeshed.

The incremental method is somewhat conceptually similar to using mysql binary logs for point-in-time recovery after a restoring a backup. Since the backup is a essentially a ‘snapshot’ of the data at the time of the backup, upon restoration the backup may be ‘out of date’ since changes to the database may have happened since the backup was taken. Replaying the binary logs brings the database up-to-date by replaying the changes.

An MV is also like a snapshot. The incremental refresh method uses the changelogs collected by FlexCDC to update the snapshot to reflect the changes that have happened in the database, instead of replaying binary logs directly.

Selecting a refresh method

The refresh method you select is determined by the following two major factors:

  1. The cost (in terms of execution time) of the query on which the view is based.
  2. The SQL features used in the query on which the view is based.

The cost of the query

The complete method completely rebuilds an MV from scratch at each refresh. This means that the minimum amount of drift for this type of MV is the amount of time it takes to execute the query on which the view is based.  An incrementally refreshable view, on the other hand, can usually be refreshed very quickly, depending on the amount of data which has changed since the last refresh. This is because the incremental refresh method must examine only those rows related to the ones which changed since the last time the view was refreshed.

The incremental refresh method can be many orders of magnitude faster than the original query execution time. It may be possible, therefore, to refresh an MV based on a query that takes 45 minutes to execute with a greater frequency than 45 minutes, even as frequently as every minute, perhaps.

The refresh method you select effectively controls the minimum amount of time it takes to refresh the view, and therefore the minimum amount of drift it will encounter. Keep in mind that you can build complete refresh views on top of incrementally refreshable view, which can give you a lot of flexibility to get the results you want, but more quickly. More on this later.

SQL features used in the query

Incremental refresh doesn’t support all of the SQL features available to SELECT statements.

Refresh type Aggregation Outer join All SQL functions Built using SQL Requires FlexCDC
COMPLETE Y Y Y Y N
INCREMENTAL Y* N N** N*** Y

*All aggregate functions supported, except GROUP_CONCAT and AVG(distinct).
**non-deterministic functions like RAND(),NOW(), etc, are not supported.
***There is a script to convert SQL to the Flexviews API (see convert.php below)

Refresh methods

The incremental refresh method

The incremental refresh method uses table changelogs which are created by FlexCDC. The refresh algorithm computes changes in multiple transactions. After each transaction it must wait for FlexCDC to process changes from that transaction. In practice, this means that it takes a minimum of a few seconds to incrementally refresh a view, even when the number of changes is very small.

The Flexviews SQL_API

Unlike views which use the complete refresh method, incrementally refreshable views are not built directly from SQL. Instead, Flexviews includes a MySQL stored procedure API, called the SQL_API which is used to define the view. These stored procedures maintain the Flexviews data dictionary. The data dictionary is used by the incremental refresh algorithm to build and maintain the view. You should always use the stored procedures to modify the dictionary. Do not modify it directly.

convert.php
You are probably unfamiliar with the SQL_API. In order to make working with the Flexviews easier, it includes a script called ‘convert.php’. This script makes it easy to create incrementally refreshable views from SQL statements. It reads one or more “CREATE TABLE db.schema … AS SELECT” and/or “INSERT INTO db.schema … AS SELECT” statements from standard input, and outputs the SQL_API statements representing the original SQL statements. Each MV name will be taken from the table name specified in each statement. You should try each statement for correctness before you attempt to convert it. You can easily check that it parses by running the SELECT portion, adding LIMIT 0 to the SELECT part of the query.

convert.php
This script takes as one argument the default database. Notice in the following example, the schema is not specified.

The MV will be placed into whatever schema is specified as the first argument on the command line:

The script (convert.php) supports basic queries which use select/group by/join/where. You may not use sub-queries or any non-deterministic functions like NOW() or RAND(). HAVING clauses, ORDER BY clauses, etc, are not supported. You can work around many of these limitations. I will discuss one of the workarounds below, in the ‘complete refresh’ section. I’ll discuss others in future blog posts.

Enable the view to use it

The SQL_API call flexviews.enable() is used to actually build the contents of the view, making it available for querying:

Data dictionary

Here is a quick example of the list of tables used by the above view, as stored in the data dictionary:

Using the dictionary

The following SQL is dynamically generated by the flexviews.get_sql() function call. This function reads the meta-data stored in the Flexviews data dictionary. It returns the SQL which represents the data stored in the view. This is much more convenient than trying to read from the dictionary directly to determine the contents of the view.

The complete refresh method

This method is actually quite simple in its operation. It will completely rebuild a view based on the SQL which defines it, each time the view is refreshed.

This is not much different from building a table with CREATE TABLE .. AS SELECT. In fact, this is actually part of what the COMPLETE refresh method does during its operation. It also takes care of atomically* replacing the new contents of the view with the old contents (using RENAME TABLE) which means that the view remains available for querying, even during refresh.

A complete refresh type view can use non-deterministic functions like NOW(). In fact any valid SELECT statement can be materialized. The definition of the MV is stored internally, and represents the actual SQL that defines the view. A simple API call is made to associate the SQL definition with the view, but the definition itself is a SQL statement, similar to a regular view. If you choose to use only the complete refresh method, then FlexCDC is not required to use Flexviews.

In the following example, the total sales amounts are calculated (for all time) from a monthly summary. The monthly summary is actually another MV. This saves a significant amount of time. If this MV did not access the other MV, then the SQL necessary to compute this list would take over 40 minutes to execute, because that is how long it takes to create the monthly summary from scratch.

Here is an example of a complete refresh MV which computes the total sales for all customers. It computes this from the dashboard_top_customers MV, ordering them in descending order: