At Yelp, Cassandra, our NoSQL database of choice, has been deployed on AWS compute (EC2) and AutoScaling Groups (ASG), backed by Block Storage (EBS). This deployment model has been quite robust over the years while presenting its own set of challenges. To make our Cassandra deployment more resilient and reduce the engineering toil associated with our constantly growing infrastructure, we are abstracting Cassandra deployments further away from EC2 with Kubernetes and orchestrating with our Cassandra Operator. We are also leveraging Yelp’s PaaSTA for consistent abstractions and features such as fleet autoscaling with Clusterman, and Spot fleets, features that will be quite useful for an efficient datastore deployment.
In this talk, we delve into the architecture of our Cassandra operator and the multi-region multi-AZ clusters it manages, and strategies we have in place for safe rollouts and zero-downtime migration. We will also discuss the challenges that we have faced en route and the design tradeoffs done. Last but not least, our plans for the future will also be shared.