November 20, 2014

Trawling the binlog with FlexCDC and new FlexCDC plugins for MySQL

Swanhart-Tools includes FlexCDC, a change data capture tool for MySQL. FlexCDC follows a server’s binary log and usually writes “changelogs” that track the changes to tables in the database. I say usually because the latest version of Swanhart-Tools (only in github for now) supports FlexCDC plugins, which allow you to send the updates to a remote data source, or to any other place of your liking.  You can find out more about FlexCDC basics in a previous blog post.

Please note that FlexCDC still needs to have source and destination instances defined in the configuration, even when using plugins.  This is because the FlexCDC state (how much into which binary log has FlexCDC progressed, and what tables are being captured) is stored in the “dest”.  Normally when using a plugin, the source and destination instances will be the same. FlexCDC will create a ‘flexviews’ database with a number of state tables in the destination instance.  This also means you still have to use the create_mvlog.php add_table.php or Flexview’s create_mvlog(…) procedure to mark which tables to capture!  See the previous blog post about FlexCDC.

When you create the mvlog, there will still be a changelog table created in the dest, just like when not using a plugin. This is because the INFORMATION_SCHEMA is used to get column datatypes and additional information (such as if an int is signed or unsigned) and this lookup is done against the table in the dest. The reason this is needed, is because mysqlbinlog, the utility used to scrape the binlog, produces strange output for large signed integers (it provides the signed and unsigned version), thus FlexCDC must figure out the right one to choose from the actual DDL of the changelog table. FlexCDC can’t look at the DDL of the source table though, because the consumer may be behind, and the current structure may not match the structure of the rows in the log.

The new plugin system allows you to do a lot of nifty things like:

  • Replicate to external databases
  • Publish changes to a message queue (this is like Facebook’s Wormhole)
  • Keep a remote cache server in sync
  • and more…

The latest version of Swanhart-Tools includes an Example plugin (in flexviews/consumer/include/example_plugin.php) that simply prints the events that come through it, not logging them into the changelog table at all. There is an example of the output at the end of the post.

The example plugin looks like this:

Important Note: You must define all seven of these functions in your plugin, even if you do not plan to have actions for each of the callbacks – just leave the function body empty to do no action (the call is simply a noop that case.) Note that the plugin functions must be declared STATIC.  This is due to the way that FlexCDC calls the functions.

Transaction state callbacks
There are three callback functions which notify the plugin of changes in transaction state. Before I go into what they do, I want to note the $trx_id and $gsn parameters which are present in every callback. Each transaction is assigned a monotonically increasing transaction identifier. This sequence uniquely identifies each transaction that FlexCDC processes. In addition, each row change is assigned a unique sequence number which FlexCDC calls the Generic Sequence Number (or GSN).

As you can see, the start_trx(…) callback (called when a transaction starts) is passed both the new transaction number and also the highest GSN used in the previous transaction. This is called the GSN high water mark (GSNHWM). At transaction commit, the commit_trx(…) callback is called and the transaction id and the last GSN assigned in the transaction are passed into the callback. This same value will appear as the GSNHWM in the next start_trx(…) callback. Finally, at rollback any sequence numbers assigned in that transaction will be re-used, so no GSN is passed to the rollback callback, but a transaction id is, which lets you determine exactly which transaction is rolling back.

Row change callbacks

Each of the four row change callback functions capture a particular change to the data set. Each of the functions take five parameters. The first ($row) is an array which contains the row being acted upon. The second ($db) is the schema which contains the row. The third ($table) is the table that contains the row. Each callback also receives the transaction identifier, and of course, each row change is assigned a unique GSN.

For example:
An update will fire both update_before(…) and update_after(…) callbacks with the row images before and after the change, respectively. There is an example of this at the end of the post.

Configuring FlexCDC to use a plugin
FlexCDC uses a configuration file called consumer.ini by default.  To the [flexcdc] section add:
plugin=plugin_file.php

The plugin must be in the FlexCDC include/ directory.  You will find example_plugin.php in this directory, to serve as an example.

How it works
Flexviews uses mysqlbinlog to decode the binary log from the source server. It uses the –decode-rows=ROWS option to decode RBR into a format which can be parsed by an external utility. FlexCDC collects information about each transaction and the row changes that happen in the database (which means it requires ROW based binary logging to be used.)  When a plugin is defined the normal actions used by FlexCDC are overridden with the callback functions.

Here is the output from the example plugin, for an update that affected 3 rows (update test.t3 set c1 = c1 – 1):

One thing you should notice, is that FlexCDC provides column names for the data coming from the binary log. This is because the log table exists in the dest instance and FlexCDC can get the list of column names from there. When you use other CDC tools, like the C binlog API, you don’t get column names.

About Justin Swanhart

Justin is a Principal Support Engineer on the support team. In the past, he was a trainer at Percona and a consultant. Justin also created and maintains Shard-Query, a middleware tool for sharding and parallel query execution and Flexviews, a tool for materialized views for MySQL. Prior to working at Percona Justin consulted for Proven Scaling, was a backend engineer at Yahoo! and a database administrator at Smule and Gazillion games.

Comments

  1. For those interested, I revised the plugin interface slightly.

    Now more than one plugin is supported.

    plugin=file1.php,file2.php,file3.php

    You can now name the class for your plugin whatever you like and specify in what order plugins are called:
    plugin_order=Class1,Class2,Class3

    I added plugin_init() and plugin_deinit() functions

    I also added an instance of FlexCDC to each function call. That way you can access any public data on the FlexCDC object, including useful helper functions.

    Finally, an Interface is provided to base your class on.

Speak Your Mind

*