November 27, 2014

Percona Testing: Innodb crash / recovery tests available

Not everyone may know this, but there are precious few innodb crash recovery tests available.

Some folks have noticed this and asked for something to be done about it, but unfortunately, no tests have been created for the main MySQL branch.

The MySQL at Facebook branch has a number of tests that are quite interesting.  They basically create a master-slave pair, subject the master to a transactional load, crash the master, restart it, then ensure the master and slave are in sync (once the test load is stopped).

The team at Percona has been known to tinker with Innodb now and again, and were also very interested in having some tests to ensure things were working as expected.  To that end, I have created the innodbCrash suite for kewpie.
These tests follow the Facebook guys’ lead for test load + validation, but we use the random query generator for generating our transactions.  This is for two reasons:

  1. My attempts at copying the Facebook multi-threaded python load-generation code seemed very, very slow
  2. I just like the randgen (it is fun!)

The tests require a debug build as they make use of the DBUG points noted in the crash test bug:

  • DBUG_EXECUTE_IF(“crash_commit_before”, DBUG_SUICIDE(););
  • DBUG_EXECUTE_IF(“crash_commit_after_prepare”, DBUG_SUICIDE(););
  • DBUG_EXECUTE_IF(“crash_commit_after_log”, DBUG_SUICIDE(););
  • DBUG_EXECUTE_IF(“crash_commit_before_unlog”, DBUG_SUICIDE(););
  • DBUG_EXECUTE_IF(“crash_commit_after”, DBUG_SUICIDE(););
  • DBUG_EXECUTE_IF(“half_binlogged_transaction”, DBUG_SUICIDE(););

What happens is that the randgen grammars have a small chance of hitting a rule that will issue a SET SESSION debug=”d,crash_commit_*”.  The test will then continue to issue DML until the server hits that debug crash point.  We can ensure that the randgen actually did encounter a server crash by examining the contents of kewpie/workdir/bot0/log/randgen.out:

 

I also took pains to ensure that the randgen workload was having an effect on the test tables between crash/recovery/validate cycles.  To examine this for yourself during a run, you can try the new option –test-debug.

I created this option as many tests might have points where we’d like to see what is going on when things fail (or we are changing things and want to make sure they still work), but that would be annoying for general test runs.  For example, we can view the comparisons of master-slave checksums after the master has crashed and restarted:

Finally, one of the most useful features of the randgen is its ability to use –seed values to change the data and/or queries for a test run.  Seed values are deterministic as each randgen run with that seed should produce the same queries and data (there are unit tests to monitor this doesn’t break, IIRC).  In the past, I have created separate variants of tests that take a –seed=time argument to give wider coverage.  However, this approach gets tiresome when one has many test cases (more to maintain!).

My solution to this problem was the addition of another option –randgen-seed.  By default this is 1, but one can also say –randgen-seed=time.  For those test cases that are written to take advantage of this (like our shiny new Innodb crash tests!), it is very easy to shuffle data and queries for all tests:

./kewpie.py –suite=innodbCrash –basedir=/path/to/basedir –force –randgen-seed=time –repeat=3 gets us:

So far, my tests have been passing without incident, but I also haven’t dug into partitions or blobs >: )

If you’d like to try these out yourself, they are in lp:kewpie (I changed the name of dbqp).  You’ll need the DBD::mysql perl module for the randgen and MySQLdb for kewpie, and a debug build of the server, but other than that, they are ready to run out of the box.  The Percona development team will be adding Jenkins runs for these tests soon

 

Speak Your Mind

*