September 2, 2014

Before every release: A glimpse into Percona XtraDB Cluster CI testing

I spoke last month at linux.conf.au 2014 in Perth, Australia, and one of my sessions focused on the “Continuous Integration (CI) testing of Percona XtraDB Cluster (PXC)” at the Developer,Testing, Release and CI miniconf.

Here is the video of the presentation:

Here is the presentation itself:

Below is a rough transcript of the talk:

This talk covered the continuous integration testing of the Galera cluster; specifically, Percona XtraDB Cluster (PXC), based on Galera, is taken into consideration. Due to the nature of the cluster, existing testing procedures of MySQL cannot be used to fully test it, newer novel methodologies are required and used to uncover bugs.

The QA automation of PXC primarily involves:

a) Jenkins

  • Primarily involves triggering of jobs, starting from VCS (bzr) checkin to build clone culminating in tests and RPM/DEB builds.
  • In some cases, manual trigger is used, whereas in other cases, SCM polling is made use of.
  • Build blocking is also used to enforce implicit job processing dependencies, for instance when galera library needs to be embedded.
  • Parameterized triggers to decrease slow VCS clones, and to pass parameters to subsequent jobs. Build plumbing and fork/join with jobs are also used.

b) Sysbench

  • Here it is used for both benchmarking and testing. Requests are simultaneously dispatched to nodes to uncover latent bugs with synchronous replication, especially with transaction – rollbacks and commits – and with conflicts, this also helps with instrumentation of latency.
  • A history of measurements from previous jobs is maintained for time-series graphing of results. This helps in identifying performance regressions.
  • MTR test suite is re-used for creating instances.

c) Random Query Generator (RQG)

  • This has again proved valuable in PXC QA, combinations testing, in particular, is used to test different combination of options, some of which may not come up in general testing but may be used out there in production by someone.
  • As in sysbench, this also stresses multiple nodes at same time but to a much higher degree. A slightly modified RQG, ported from MariaDB RQG Galera extension (https://mariadb.com/kb/en/rqg-extensions-for-mariadb-features/) is being used. Various kinds of statements and transactions are tested, but most importantly, since they run concurrently, bugs surface much easily. Several MDL and DDL related bugs (with TOI) have been found and successfully fixed with this method.
  • With combinations testing, since the number of combinations can get astronomically large, pruning of infeasible combinations is also done.
  • It has also been extended to collect backtraces when server is hard deadlocked (when Deadlock reporter also fails). This has been quite valuable with bugs where obtaining backtraces has had been vital.

d) SST Testing

  • SST stands for State Snapshot Transfer. This is more of an end-to-end testing, in that this test starts with starting a node, loading it with data, starting another node after SST from first node, making sure the data is consistent (by checksumming). This is done with several different combinations of configurations which also tests the SST mechanism itself while at the same time testing the server with these combinations. So, a success of these tests indicates a cluster will start and work as intended (thus, no blue smoke!).
  • This re-uses PXB test suite with Xtrabackup.
  • Also, serves to test PXC on different platforms (13×2 so far).

e) Replication Testing

  • This was written to test upgrades between major versions, 5.5 and 5.6
  • Intended to test rolling upgrade
  • Re-uses earlier test components – MTR, sysbench, SST – since it involves starting two nodes, upgrading one of them and replication stream between them
  • Overlaps with other tests in coverage

f) Other techniques such as use of lock_wait_timeout (defaulting to one year) to catch MDL bugs, use of release and debug builds differently in tests: with the manifestation of a bug in either (an assertion/crash in debug build being a server hang in release buid for instance) are also used.

g) In future, we intend to have:

  • Testing at a higher level with Chef, Puppet etc., intending
    to test packaging
  • Also, to test distro idiosyncrasies
  • Automated handling of test results with extraction and
    analysis of logs and backtraces 

  • Also, currently, we don’t test for externalities like network jitters (something that can be simulated with netem). Doing this requires moving from (or cloning) node-per-process model to node-per-slave (jenkins slave). This can, for instance, help with debugging of issues associated with evs (extended virtual synchrony) layer of galera.
    • Incidentally, this was also one of the questions after the talk, since a few other *aaS providers tend to bring up separate jenkins slaves for testing, where they test for features associated with virtualization for instance (as in case of OpenStack).

To conclude, as you may notice, there is a certain degree of overlap between these tests. This is intended, so that if one type of test misses it, other catches it, making it easy to detect the hard-to-catch bugs.

About Raghavendra

I am Raghavendra Prabhu and I am currently working at Percona Inc. as Product Lead of Percona XtraDB Cluster (PXC). You can visit my personal blog at http://blog.wnohang.net.

Comments

  1. I’ve used cgroups to create a very slow PXC node like this:
    mkdir /sys/fs/cgroup/cpu/slowdown
    cd /sys/fs/cgroup/cpu/slowdown
    echo 1000 > cpu.cfs_quota_us
    for task in $(ls /proc/$(pgrep -x mysqld)/task);
    do echo $task > tasks; done

Speak Your Mind

*