How Percona Support handles bugsPeter Zaitsev
One of the great values of a Percona Support contract is that we provide bug fixes for covered software, and not just support in terms of advice on how to use it. This is the skill which is most likely missing from in-house for most customers, as it requires a team with code knowledge to build and test infrastructure – something only a few companies can afford to invest in.
There is a lot of misunderstanding about bugs. What is a bug? What is a feature? What is a repeatable bug? How will Percona troubleshoot the bug? In this post I will answer some of the questions about this.
Bugs vs. Features ⎼ One thing a lot of people have a hard time understanding is the difference between a bug and a feature, or when software was designed to work a certain way which might be unwelcome. There is a gray line here, but you need to expect that some of the things you consider to be bugs will be seen as behavior-change features and will be considered as such.
Unfixable Bugs ⎼ There are some behaviors that any sane person would call a bug, but which arise from design limitations or oversight that are impossible to fix in the current GA version without introducing changes that would destabilize it. Such bugs will need to be fixed in the next major GA release or sometimes even further in the future. Some bugs are not bugs at all but rather design tradeoffs made. These can’t be “fixed” unless different design tradeoffs are chosen.
Workaround ⎼ There are going to be unexpected behaviors, unfixable bugs and bugs that take awhile to fix, so your first practical response to running into the bug is often finding a workaround which does not expose it. The Percona Support team will help find a workaround that causes minimal impact to your business, but be prepared: changes to the application, deployed version, schema or configuration will often be required.
Emergencies ⎼ When you have an emergency, our focus is to restore the system to working order. In a complex system a bug fix can often not be delivered in a short period of time, which typically means finding a workaround.
Bug Turnaround ⎼ It is not possible to guarantee the turnaround on a bug fix, as all bugs are different. Some bugs are rather trivial and we might be able to provide a hotfix 24 hours after we have a repeatable test case. In other cases the bug might be complicated and take weeks of engineering to fix or even might be impossible to fix in the current GA version.
Verified Bug Fixes ⎼ When you submit the bug we have to verify if it is actually being a bug. In many cases it might be intended behavior; in others, a user mistake. It is also possible that the behavior has happened once and can’t be repeated. Having a repeatable test case that reveals the bug is the best way to have a bug fixed quickly. You might be able to create a repeatable test case, or our support team might be able to help you create the test case.
Sporadic Bugs ⎼ These are very hard bug types that happen sporadically over a period of time. For example, you might have a system crash once every 3 months with no way to repeat it. The cause of such bugs can be very complicated; for example, a buffer overrun in one piece of code can cause corruption and crash in another place hours later. There are a number of diagnostic tools that exist for such bugs, but generally they take quite awhile to resolve. In addition, without a repeatable test case, it is often impossible to verify that the proposed fix actually resolves the bug.
Environmental Bugs ⎼ Some bugs are caused by what can be called your environment. It could be some hardware bugs or incompatibilities, a build not quite compatible with your version of operating system, operating system bugs, etc. In some cases we can very clearly point to the environment problems. In others we can suspect the environment is an issue and we may ask you to see if the bug also happens in another environment, such as different hardware or OS installation.
Hot Fixes ⎼ As our default policy we fix bugs in the next release of our software so it can go through the full QA cycle, be properly documented, etc. If you have implemented a workaround and you can wait until the next release, this is the best choice. If not, with the Percona Enterprise or Premiere Support contract, we can provide you with a hotfix that is a special build containing the version of the software you’re running, and with only the bug fix of interest applied. Hotfixes are especially helpful if you’re not looking to do a full software upgrade – requiring several revisions – but want to validate the fix with the minimum possible changes. Hotfixes might also be different from the final bug fix that goes into the GA release. With hotfixes, our goal is to provide a working solution for you faster. Afterward we may optimize or re-architect the code, come up with better option names, etc.
Bug Diagnostics ⎼ Depending on the nature of the bug there are multiple tools that our support team will use for diagnostics – finding a way to fix the bug. To set expectations correctly, it can be a very involved process, where you might need to provide a lot of information or try things on your system, such as:
- Test case. If you have a test case that can be repeated by the Percona team to trigger the bug, the diagnostic problem is solved from the customer side. Internal debugging starts at this point. It might not be easy to get to that.
- If we have a crash that we can’t repeat on our system we often will ask you to enable “core” file, or run the program under a debugger so we can get more information when the crash happens.
- If the problem is related to performance, you should be ready to gather both MySQL information such as EXPLAIN, status counters, information from performance schema, etc., along with system level information such as pt-pmp output, pt-stalk, oprofile, perf, etc.
- If the problem is a “deadlock,” we often need information from gdb about the full state of the system. Information from processlist, performance_schema, SHOW ENGINE INNODB STATUS can also be helpful.
- It can be very helpful when you have a test system on which you can repeat the problem in your environment and where you can experiment without impacting production. It is not possible in all cases, but is very helpful.
- Sometimes, for hard-to-repeat bugs, we will need to run a special diagnostics build that provides us with additional debug information. In others, we might need to run a debug build or do a run under valgrind or other software designed to catch bugs. It often has a large performance impact, so it is good to see how your workload can be scaled down in order for this to be feasible.
- Depending on your environment we might need to login to troubleshoot your bug or might request that you upload the data needed to repeat the bug in our lab (assuming it is not too sensitive). In cases where direct login is not possible, we can help you to come to a repeatable test case via phone, chat, or email. Using screen sharing can also be very helpful.
Bugs and Non-Percona Software ⎼ Percona Support covers some software not produced by Percona. For open source software, if it is not exempt from bug fix support, we will provide the custom build with a bug fix as well as provide the suggested fix to the software maintainer for its possible inclusion in the next release. For example, if we find a bug in the MySQL Community Edition, we will pass our suggested fix to the MySQL Engineering team at Oracle. For other software that is not open source, such as Amazon RDS, we can help to facilitate creation and submission of a repeatable test case and workaround, but we can’t provide a fix as we do not have access to the source code.
In Conclusion ⎼ When I think about software bugs, I find some good parallels with human “bugs” (diseases). Some issues are trivial to diagnose and the fix is obvious. Others might be very hard to diagnose ⎼ I guess many of us have been in a situation where you visit doctor after doctor and tell them your symptoms, and they run tests but still can’t figure out what’s wrong. Once a diagnosis is done, though, it is not always given the “fix” available or feasible, and while a complete solution is preferred, sometimes we have to settle for “managing” the disease, which is our parallel to implementing changes and settling for a workaround. So in the same way as human doctors, we can’t guarantee we will get to the root of every problem, or if we do, that we will be able to fix every one of them. However, as with having good doctors – having us on your team will maximize the chance of successful resolution.