Enabling event-driven analytics with custom Tungsten Replicator filters and RabbitMQ

MySQL Case Studies
16 April 1:50PM - 2:40PM @ Ballroom E

Experience level: 
Intermediate
Duration: 
50 minutes conference

Rate This Session

Technologies Covered: * MySQL * Tungsten Replicator (https://code.google.com/p/tungsten-replicator/) * Rabbit MQ (http://www.rabbitmq.com) Takeaways: * Get the right data to the right people at the right time. * Custom filters for Tungsten Replicator enable lateral thinking with respect to data movement and use. At Smartsheet, nightly ETL jobs have driven a variety of internal (sales, marketing, operations, and business intelligence) workflows for years. The number of workflows has increased and the volume of data they consume has more or less exploded as we have grown. At the same time, our tolerance for latency is going away. The business' idea of "the right time" to respond to events doesn't always align with the ETL delivery schedule of "tomorrow morning." To address these challenges -- rapid growth and decreased latency -- we have begun rolling out a new system built around custom filters for Tungsten Replicator and RabbitMQ. This new system lets us respond to selected events in near real-time and grow by scaling horizontally. The new architecture uses Tungsten Replicator with a "direct" pipeline for transaction-at-a-time access to the binlogs on the master. A Tungsten Replicator filter can examine and manipulate the set of rows changed in each transaction. We developed a custom filter for Tungsten that uses rules to define which row changes should be published to RabbitMQ. Each rule specifies row-matching criteria and the RabbitMQ "routing key" to use when publishing. Row matching is done based on schema name, table name, and the type of change (INSERT, UPDATE, or DELETE). If a rule does not specify a routing key, a default of "schema name.table name.change type" is used, (e.g. "ourdb.users.INSERT"). The message published is a JSON blob containg the primary key (columns and values) from the matched row. Because we are not using Tungsten Replicator to do actual replication, we implemented a second filter that simply discards every event it receives. A collection of analytic applications subscribe to receive the messages for the row changes they care about. This event-driven approach ensures that these applications are able to deliver near real-time business intelligence, kick off workflows, and update various secondary and tertiary data stores with minimal latency. The event-driven model delivers natural horizontal scaling of analytic workloads -- enabling us to keep pace with the ever growing data volume and new business intelligence demands. Finally, RabbitMQ supports clients in a variety of languages, which provides substantial flexibility to analytic tool developers.


Speakers

Principal Systems Engineer, Smartsheet
Scott “cut his teeth” running a team in operations at EarthLink Network in 1995, just as the tech boom was ramping up.  From there, he moved on to a pair of small security startups (Cylant Technology and then Cylant) that developed kernel-level behavior-analytic security solutions for Linux.  First at Cylant and then at BAE Systems, he worked on DARPA's first three major defensive cyber-security programs.  After BAE Systems, he went to The PTR Group where most of his work focused on the intersection of cyber-security and embedded systems.  While at The PTR Group, he also helped develop and teach immersive, hands-on classes in IPv6 networking and Linux virtualization.  Now at Smartsheet, Mr. Wimer focuses on tackling scalability challenges and other “hard problems.”
Director of Infrastructure, Smartsheet.com
-