Using Flexviews – part two, change data capture

In my previous post I introduced materialized view concepts. This post begins with an introduction to change data capture technology and describes some of the ways in which it can be leveraged for your benefit. This is followed by a description of FlexCDC, the change data capture tool included with Flexviews. It continues with an overview of how to install and run FlexCDC, and concludes with a demonstration of the utility.

As a reminder, the first post covered the following topics:

  1. What is a materialized view(MV)?
  2. It explained that an MV can pre-compute joins and may aggregate and summarize data.
  3. Using the aggregated data can significantly improve query response times compared to accessing the non-aggregated data.
  4. Keeping MVs up-to-date (refreshing) is usually expensive.
  5. A change data capture tool can be used to implement an efficient way of refreshing them.

 

What is Change Data Capture (CDC)?

As the name implies, CDC software captures the changes made to database rows and makes those changes available in a convenient form which can be accessed by other programs. CDC applications exist for many commercial databases but until recently this type of software was not available for MySQL.

Change Data Capture can be used to:

  1. Monitor a database table, or tables for changes.
  2. Improve ETL performance by identifying the data which has changed.
  3. Maintain materialized views with Flexviews (the primary purpose of FlexCDC).
  4. Feed search engines like Sphinx or Solr only the rows that change.
  5. Feed third party replication systems.
  6. Provide data to “external triggers” such as Gearman jobs.

CDC tools usually operate in one of the following ways:

  1. Timestamps (usually more than one) to identify rows that have changed
  2. Triggers to capture changes synchronously
  3. Database log reading to capture the changes asynchronously

The first method has serious drawbacks, such that it can’t identify deleted rows and MySQL timestamps may not be flexible enough.

The trigger method has a lot of problems. Triggers add a significant overhead. When the structure of a table is changed, the triggers must be changed. The work in the trigger is immediate and affects every transaction. Finally, MySQL has limited trigger support, some of which is the cause of the aforementioned problems. The biggest problem, at least from standpoint of how Flexviews works, is that triggers can not, under normal conditions, detect the commit order of transactions. This above all makes triggers an unacceptable CDC method.

This leaves the third method, log based capture as the best option because it imposes less overhead than triggers and change data capture may be done remotely and asynchronously.

 

Binary log based CDC

The CDC tool included with Flexviews is called FlexCDC. It seemed like an appropriate name. The Binary Log is the MySQL log which records changes to tables in the database. FlexCDC reads the binary log to determine what rows have changed. Because of this, FlexCDC requires that you use row-based binary logs (RBR). If you don’t have MySQL 5.1 or aren’t using RBR, then it is possible to set up a dedicated MySQL slave which has log_slave_updates=1 and binlog_format=row to process SBR changes from a MySQL master. I’ll talk more about that in another blog post.

FlexCDC does not implement a full binary log parser. Instead, it invokes the ‘mysqlbinlog’ utility and it processes the predictable output of this program. mysqlbinlog will always be able to read the binary logs of the version of mysql it ships with (and previous versions) so there is no worry about binary log format changes. FlexCDC is written in PHP so it is portable.

 

Setting up FlexCDC

FlexCDC has some basic requirements:

  • MySQL 5.1+
  • row based logging (binlog_format=1)
  • unique server_id in the my.cnf
  • log_slave_updates=1, if this is a MySQL slave
  • transaction_isolation=READ-COMMITTED

You can get FlexCDC directly out of the Flexviews SVN. I suggest that you just grab all of Flexviews:

Next you have to customize the example ini file. FlexCDC is located in the flexviews/consumer/ subdirectory.

Create the settings file:

Change to the flexviews/consumer directory and copy the consumer.ini.example file to consumer.ini and edit it, making appropriate changes. The file is well commented. The example settings file should work for most MySQL installations which allow connections as root with no password from localhost. It is possible to read from and/or write to remote servers, but this example uses the local machine which is the usual configuration for Flexviews since it requires local access to the tables and the changelogs in order to maintain materialized views. Most database servers have some spare CPU for binary log parsing.

Run the setup script:

This will create the metadata tables and capture the initial binary log position.

If the setup detects any problems (such as binary logging not being enabled) it will exit with an error. It will exit with a message “setup completed” otherwise.

 

Verify installation

The binary log stores it progress in a metadata table:

If you select from that table you won’t see anything changing, even if you are writing into your database. This isn’t anything to worry about, since the background process isn’t running yet.

Starting the background process:

FlexCDC includes a consumer_safe.sh script that will start up a copy of FlexCDC and restart it if it exits with error. You can shut down FlexCDC by sending it a HUP signal. The script will drop a .pid file so you know what process to HUP.

 

Adding a changelog to a table

FlexCDC copies the contents of database rows which change into special tables called changelogs. Each changelog is located in the flexviews database and is named $SCHEMA_$TABLE where $SCHEMA is the schema in which the source table is located and $TABLE the name of the source table. If that is confusing it should be clear in a moment.

Lets create a table, insert some rows, add a change log, delete rows and then insert some more of them:

Even though FlexCDC is running in the background, it didn’t capture any changes from that insert. We need to add the table to the list of tables to changelog. There is a utility included with FlexCDC called ‘add_table.php’. This script automates the process of adding a table to the list of tables to changelog. It does this by adding an entry to the flexviews.mvlogs metadata table, and it creates the changelog table itself.

Note that you can enable auto_changelog=true in the config file to automatically record changes for any table, starting from the first time a change is seen for that table. This is generally only useful if you have a small number of tables, and you want to track changes on all of them.

You may have also noted that I did not include –ini=consumer.ini. This is because this is the default config filename to search for. I included it in the earlier examples for illustration purposes.

 

Examine the changes

Now that the changelog has been added, any changes to test.demo will automatically be captured.

Insert data in one transaction (two rows):

And delete data in a second transaction:

The changelog is flexviews.test_demo. This because the source table is test.demo.

As you can see, there are three rows in the changelog, each representing one of the changes we made.

You will notice that the source table only has two columns, but the changelog contains five. All change logs contain three additional metadata columns: dml_type, uow_id and fv$server_id. These columns represent the type of change (insert is 1, delete -1), the transaction order and the source of the changes, respectively.

Finally, note that the two insertions happened inside of the same transaction, and that the insertions happened before the deletion. Though they are none shown here, updates would be represented by a deletion followed by an insertion.

Share this post

Comments (56)

  • aivan

    Hi jus