October 24, 2014

Fishing with dynamite, brought to you by the randgen and dbqp

I tend to speak highly of the random query generator as a testing tool and thought I would share a story that shows how it can really shine. At our recent dev team meeting, we spent approximately 30 minutes of hack time to produce test cases for 3 rather hard to duplicate bugs. Of course, I would also like to think that the way we have packaged our randgen tests into unittest format for dbqp played some small part, but I might be mildly biased.

The best description of the randgen’s power comes courtesy of Andrew Hutchings – “fishing with dynamite“. This is a very apt metaphor for how the tool works – it can be quite effective for stressing a server and finding bugs, but it can also be quite messy, possibly even fatal if one is careless. ; ) However, I am not writing this to share any horror stories, but glorious tales of bug hunting!

The randgen uses yacc-style grammar files that define a realm of possible queries (provided you did it right…the zen of grammar writing is a topic for another day). Doing this allows us to produce high volumes of queries that are hopefully interesting (see previous comment about grammar-writing-zen).

It takes a certain amount of care to produce a grammar that is useful and interesting, but the gamble is that this effort will produce more interesting effects on the database than the hand-written queries that could be produced in similar time. This is especially useful when you aren’t quite sure where a problem is and are just trying to see what shakes out under a certain type of stress. Another win is that a well-crafted grammar can be used for a variety of scenarios. The transactional grammars that were originally written for testing Drizzle’s replication system have been reused many times (including for two of these bugs!)

This brings us to our first bug:
mysql process crashes after setting innodb_dict_size

The basics of this were that the server was crashing under load when innodb_dict_size_limit was set to a smaller value. In order to simulate the situation, Stewart suggested we use a transactional load against a large number of tables. We were able to make this happen in 4 easy steps:
1) Create a test case module that we can execute. All of the randgen test cases are structured similarly, so all we had to do was copy an existing test case and tweak our server options and randgen command line as needed.

2) Make an altered copy of the general, percona.zz gendata file. This file is used by the randgen to determine the number, composition, and population of any test tables we want to use and generate them for us. As the original reporter indicated they had a fair number of tables:

The value in the ‘rows’ section tells the data generator to produce 50 tables, with sizes from 1 row to 50 rows.

3) Specify the server options. We wanted the server to hit similar limits as the original bug reporter, but we were working on a smaller scale.
To make this happen, we set the following options in the test case:

Granted, these are insanely small values, but this is a test and we’re trying to do horrible things to the server ; )

4) Set up our test_* method in our testcase class. This is all we need to specify in our test case:

The test is simply to ensure that the server remains up and running under a basic transactional load

From there, we only need to use the following command to execute the test:
./dbqp.py –default-server-type=mysql –basedir=/path/to/Percona-Server –suite=randgen_basic innodbDictSizeLimit_test
This enabled us to reproduce the crash within 5 seconds.

The reason I think this is interesting is that we were unable to duplicate this bug otherwise. The combination of the randgen’s power and dbqp’s organization helped us knock this out with about 15 minutes of tinkering.

Once we had a bead on this bug, we went on to try a couple of other bugs:

Crash when query_cache_strip_comments enabled

For this one, we only modified the grammar file to include this as a possible WHERE clause for SELECT queries:

The test value was taken from the original bug report.
Similar creation of a test case file + modifications resulted in another easily reproduced crash.
I will admit that there may be other ways to go about hitting that particular bug, but we *were* practicing with new tools and playing with dynamite can be quite exhilarating ; )
parallel option breaks backups and restores

For this bug, we needed to ensure that the server used –innodb_file_per_table and that we used Xtrabackup‘s –parallel option. I also wanted to create multiple schemas and we did via a little randgen / python magic:

This gave us 7 schemas, all with 100 tables per schema (with rows 1-100). From here we take a backup with –parallel=50 and then try to restore it. These are basically the same steps we use in our basic_test from the xtrabackup suite. We just copied and modified the test case to suit our needs for this bug. With this setup, we need a crash / failure during the prepare phase of the backup. Interestingly this only happens with this number of tables, schemas, and –parallel threads.

Not too shabby for about 30 minutes of hacking + explaining things, if I do say so myself. One of the biggest difficulties in fixing bugs comes from being able to recreate them reliably and easily. Between the randgen’s brutal ability to produce test data and queries and dbqp’s efficient test organization, we are now able to quickly produce complicated test scenarios and reproduce more bugs so our amazing dev team can fix them into oblivion : )

Speak Your Mind

*