Percona Live: Data Performance Conference 2016 Logo

April 18-21, 2016

Santa Clara, California

Operational Buddhism: Building Reliable Services From Unreliable Components

Operational Buddhism: Building Reliable Services From Unreliable Components

 20 April 11:10 AM - 12:00 PM @ Ballroom E
Experience level: 
50 minutes conference
Operations and Management
High Availability


The rise of utility computing has revolutionized much about the way organizations think about infrastructure and back-end serving systems compared to the "olden days" of dedicated physical data centers. However, in the final analysis, success is still driven by meeting your SLAs. If services are up and sufficiently performant, you win. If not, you lose. In the traditional data center environment, fighting the uptime battle was typically driven by a philosophy I call "Operational Materialism." The primary goal of OM is preventing failures at the infrastructure layer, and mechanisms for making this happen are plentiful and well-understood, many of which boil down to simply spending enough money to have at least N+1 of anything that might fail and create significant downtime as a result. Redundant power supplies, NIC bonding, replicated SANs, and hot-standby servers are some of the common artifacts of an OM world. In the cloud, however, Operational Materialism cannot succeed. Although the typical cloud provider tends to be holistically reliable, there are no guarantees that any individual virtual instance will not randomly or intermittently drop off the network or be terminated outright. Yet we still need to keep our services up and running and meet our SLAs, and thus we need a different mindset that accounts for the fundamentally opaque and ephemeral nature of the public cloud. In this talk, I will present an alternative to OM, a worldview that I refer to as "Operational Buddhism." Like traditional Buddhism, OB has Four Noble Truths: 1. Cloud-based servers can fail at any time for any reason. 2. Trying to prevent this server failure is an endless source of suffering for DBAs and SREs alike. 3. Accepting the impermanence of individual servers, we can focus on designing systems that are failure-resilient, rather than failure-resistant. 4. We can escape the cycle of suffering and create a better experience for our customers, users, and colleagues. To illustrate these concepts with concrete examples, I will discuss how configuration management, automation, and service discovery help us to practice Operational Buddhism at Pinterest for both stateful (MySQL, HBase) and stateless (web) services. Moreover, as our path is not the only road to infrastructure enlightenment, I'll also talk about some of the roads not taken, including the debate over Infrastructure-as-a-Service (IaaS) vs. Platform-as-a-Service (PaaS). Only minimal prior knowledge of MySQL and HBase will be assumed; basic familiarity with some of the different offerings from Amazon Web Services (RDS, EC2, S3) will also be helpful.


Ernie Souhrada's picture

Ernie Souhrada

Database Engineer and Bit Wrangler, Pinterest


Ernie is a database engineer on the SRE team at Pinterest, where his current focus is on improving the performance and operational efficiency of a petabyte-scale hybrid deployment of MySQL, HBase, and Redis. Ernie has worked in almost every aspect of information technology, from network engineering and software development to systems administration and information security. Ernie's current areas of interest include artificial intelligence, data analytics, and neuroscience. He holds a BS in mathematics and a BA in political science from Arizona State University.

Share this talk