ChaosMesh to Create Chaos in Kubernetes

ChaosMesh to Create Chaos in KubernetesIn my talk on Percona Live (download the presentation), I spoke about how we can use Percona Kubernetes Operators to deploy our own Database-as-Service, based on fully OpenSource components and independent from any particular cloud provider.

Today I want to mention an important tool that I use to test our Operators: ChaosMesh, which actually is part of CNCF and recently became GA version 1.0.

ChaosMesh seeks to deploy chaos engineering experiments in Kubernetes deployments which allows it to test how deployment is resilient against different kinds of failures.

Obviously, this tool is important for Kubernetes Database deployments, and I believe this also can be very useful to test your application deployment to understand how the application will perform and handle different failures.

ChaosMesh allows to emulate:

  • Pod Failure: kill pod or error on pod
  • Network Failure: network partitioning, network delays, network corruptions
  • IO Failure: IO delays and IO errors
  • Stress emulation: stress memory and CPU usage
  • Kernel Failure: return errors on system calls
  • Time skew: Emulate time drift on pods

For our Percona Kubernetes Operators, I found Network Failure especially interesting, as clusters that rely on network communication should provide enough resiliency against network issues.

Let’s review an example of how we can emulate a network failure on one of the pods. Assume we have cluster2 running:

And we will isolate cluster2-pxc-1 from the rest of the cluster, by using the following Chaos Experiment:

This will isolate the pod  cluster2-pxc-1 for three seconds. Let’s see what happens with the workload which we directed on cluster2-pxc-0 node (the output is from sysbench-tpcc benchmark):

And the log from cluster2-pxc-1 pod:

We can see that the node lost communication for three seconds and then recovered.

There is a variable evs.suspect_timeout with default five sec which defines the limit of how long the nodes will wait till forming a new quorum without the affected node. So let’s see what will happen if we isolate  cluster2-pxc-1 for nine seconds: