A Billion Messages a Day - Yelp's Real-time Data Pipeline
Yelp moved quickly into building out a comprehensive service oriented architecture, and before long had over 100 data-owning production services. Distributing data across an organization creates a number of issues, particularly around the cost of joining disparate data sources, dramatically increasing the complexity of bulk data applications. Straightforward solutions like bulk data APIs and sharing data snapshots have significant drawbacks. Yelp's Data Pipeline makes it easier for these services to communicate with each other, provides a framework for real-time data processing, and facilitates high-performance bulk data applications - making large SOAs easier to work with. The Data Pipeline provides a series of guarantees that makes it easy to create universal data producers and consumers that can be mashed up into interesting real-time data flows. In one of Yelp's more interesting applications, we created a tool that connects to our MySQL binary replication logs, publishing row-level changes into the pipeline, where they then flow to a variety of targets, including Salesforce, Amazon Redshift, and are indexed for search.
Software Engineer, Yelp
Justin Cunningham is the technical lead for the Business Analytics and Metrics team at Yelp, principally working on scaling Yelp's data infrastructure to support over 100 million monthly unique visitors. Before Yelp, Justin worked at several small startups that he founded.