In my talk on Percona Live (download the presentation), I spoke about how we can use Percona Kubernetes Operators to deploy our own Database-as-Service, based on fully OpenSource components and independent from any particular cloud provider.
Today I want to mention an important tool that I use to test our Operators: ChaosMesh, which actually is part of CNCF and recently became GA version 1.0.
ChaosMesh seeks to deploy chaos engineering experiments in Kubernetes deployments which allows it to test how deployment is resilient against different kinds of failures.
Obviously, this tool is important for Kubernetes Database deployments, and I believe this also can be very useful to test your application deployment to understand how the application will perform and handle different failures.
ChaosMesh allows to emulate:
- Pod Failure: kill pod or error on pod
- Network Failure: network partitioning, network delays, network corruptions
- IO Failure: IO delays and IO errors
- Stress emulation: stress memory and CPU usage
- Kernel Failure: return errors on system calls
- Time skew: Emulate time drift on pods
For our Percona Kubernetes Operators, I found Network Failure especially interesting, as clusters that rely on network communication should provide enough resiliency against network issues.
Let’s review an example of how we can emulate a network failure on one of the pods. Assume we have cluster2 running:
1 2 3 4 5 6 7 8 |
kubectl get pods NAME READY STATUS RESTARTS AGE cluster2-haproxy-0 2/2 Running 1 12d cluster2-haproxy-1 2/2 Running 2 12d cluster2-haproxy-2 2/2 Running 2 12d cluster2-pxc-0 1/1 Running 0 12d cluster2-pxc-1 1/1 Running 0 12d cluster2-pxc-2 1/1 Running 0 12d |
And we will isolate cluster2-pxc-1 from the rest of the cluster, by using the following Chaos Experiment:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
apiVersion: chaos-mesh.org/v1alpha1 kind: NetworkChaos metadata: name: pxc-network-delay spec: action: partition # the specific chaos action to inject mode: one # the mode to run chaos action; supported modes are one/all/fixed/fixed-percent/random-max-percent selector: # pods where to inject chaos actions pods: pxc: # namespace of the target pods - cluster2-pxc-1 direction: to target: selector: pods: pxc: # namespace of the target pods - cluster2-pxc-0 mode: one duration: "3s" scheduler: # scheduler rules for the running time of the chaos experiments about pods. cron: "@every 1000s" --- apiVersion: chaos-mesh.org/v1alpha1 kind: NetworkChaos metadata: name: pxc-network-delay2 spec: action: partition # the specific chaos action to inject mode: one # the mode to run chaos action; supported modes are one/all/fixed/fixed-percent/random-max-percent selector: # pods where to inject chaos actions pods: pxc: # namespace of the target pods - cluster2-pxc-1 direction: to target: selector: pods: pxc: # namespace of the target pods - cluster2-pxc-2 mode: one duration: "3s" scheduler: # scheduler rules for the running time of the chaos experiments about pods. cron: "@every 1000s" |
This will isolate the pod cluster2-pxc-1 for three seconds. Let’s see what happens with the workload which we directed on cluster2-pxc-0 node (the output is from sysbench-tpcc benchmark):
1 2 3 4 5 6 7 8 9 |
1041,56,1232.46,36566.42,16717.16,17383.33,2465.93,90.78,4.99,0.00 1042,56,1305.42,35841.03,16295.74,16934.44,2610.84,71.83,6.01,0.00 1043,56,1084.73,30647.99,14056.49,14422.06,2169.45,68.05,5.99,0.00 1044,56,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 1045,56,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 1046,56,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00 1047,56,129.00,4219.97,1926.99,2034.98,258.00,4683.57,0.00,0.00 1048,56,939.41,25800.68,11706.55,12215.31,1878.82,960.30,2.00,0.00 1049,56,1182.09,34390.72,15708.49,16318.05,2364.18,66.84,4.00,0.00 |
And the log from cluster2-pxc-1 pod:
1 2 3 4 5 6 7 8 9 10 11 |
2020-11-05T17:36:27.962719Z 0 [Warning] WSREP: Failed to report last committed 133737, -110 (Connection timed out) 2020-11-05T17:36:29.962975Z 0 [Warning] WSREP: Failed to report last committed 133888, -110 (Connection timed out) 2020-11-05T17:36:30.243902Z 0 [Note] WSREP: (11fdd640, 'ssl://0.0.0.0:4567') turning message relay requesting on, nonlive peers: ssl://192.168.66.9:4567 ssl://192.168.71.201:4567 2020-11-05T17:36:31.161485Z 0 [Note] WSREP: SSL handshake successful, remote endpoint ssl://192.168.66.9:34760 local endpoint ssl://192.168.61.137:4567 cipher: ECDHE-RSA-AES256-GCM-SHA384 compression: none 2020-11-05T17:36:31.162325Z 0 [Note] WSREP: (11fdd640, 'ssl://0.0.0.0:4567') connection established to 0008bac8 ssl://192.168.66.9:4567 2020-11-05T17:36:31.162694Z 0 [Note] WSREP: (11fdd640, 'ssl://0.0.0.0:4567') reconnecting to 448e265d (ssl://192.168.71.201:4567), attempt 0 2020-11-05T17:36:31.174019Z 0 [Note] WSREP: SSL handshake successful, remote endpoint ssl://192.168.71.201:4567 local endpoint ssl://192.168.61.137:47252 cipher: ECDHE-RSA-AES256-GCM-SHA384 compression: none 2020-11-05T17:36:31.176521Z 0 [Note] WSREP: SSL handshake successful, remote endpoint ssl://192.168.71.201:56892 local endpoint ssl://192.168.61.137:4567 cipher: ECDHE-RSA-AES256-GCM-SHA384 compression: none 2020-11-05T17:36:31.177086Z 0 [Note] WSREP: (11fdd640, 'ssl://0.0.0.0:4567') connection established to 448e265d ssl://192.168.71.201:4567 2020-11-05T17:36:31.177289Z 0 [Note] WSREP: (11fdd640, 'ssl://0.0.0.0:4567') connection established to 448e265d ssl://192.168.71.201:4567 2020-11-05T17:36:34.244970Z 0 [Note] WSREP: (11fdd640, 'ssl://0.0.0.0:4567') turning message relay requesting off |
We can see that the node lost communication for three seconds and then recovered.
There is a variable evs.suspect_timeout with default five sec which defines the limit of how long the nodes will wait till forming a new quorum without the affected node. So let’s see what will happen if we isolate cluster2-pxc-1 for nine seconds: