TAM Enterprise Experiences – Troubleshooting Methodology

TAM Troubleshooting MethodologyOne of the great things about being a Technical Account Manager (TAM, for short) at Percona is having the opportunity to see a wide variety of issues across many clients, and having the space to identify the common threads that seem to bind many of these issues. While the challenges vary as widely as the troubleshooting methods themselves, being able to step back from a technical issue and look in from the outside at the whole picture has proven to be incredibly useful. No matter the complexity of the problem, troubleshooting is often best served by going back to the simplest of questions. A broader, more holistic view can allow honing in on the actual problem being presented, and in many cases, you may discover that the wrong questions are being asked, at least initially.

In this TAM Enterprise Experiences blog post, I’d like to outline a few helpful suggestions to help simplify troubleshooting and triage by utilizing a “big picture” mentality when working through a given issue.

What’s Your Problem?

The first question to ask is what specifically is the problem being addressed? Oftentimes this isn’t as obvious as you might first think. Sometimes the way the problem was described is misleading, and it is helpful to confirm and verify exactly what problem is being solved.

For instance, a common theme maybe for an application team to report to the database team that the database has become slow and sluggish, impacting the overall user experience. This seems straightforward, and at first, it seems the problem has already been identified. However, has the database actually slowed down based on historical metrics or has application traffic increased to the point where the existing architecture can no longer keep up?

Both of these options could potentially have the same symptoms while having drastically different causes and solutions. Accurately defining the problem will ensure you don’t end up on a wild goose chase and can get straight to work on addressing the actual issue.

First Time?

Once the problem has been defined, the next question you should ask is, “has this happened before?” As is often the case in this industry, many problems are recurring, so knowing if you have faced this issue in the past and how it was dealt with then is critical information to have.

Answering this question can often depend on documenting any institutional knowledge within your organization, to be sure that any issues that are resolved are well documented and defined for future recurrences. This is often done within a ticketing system (JIRA, ServiceNow, Zendesk), and while that does work I would still recommend writing more formal internal knowledge-base articles for any major issues or changes, and making sure that these articles are kept up to date and searchable.

It Was Fine Yesterday

When was the last time the system was behaving normally? If this is an ongoing issue, knowing when the problem started is critical in determining what changed that could have caused problems. If you’re unsure exactly when the issue happened, how can you determine if a recent code release might be causing the issue? How would determine if any recent schema or user changes happened immediately prior to causing the issue? Without precise timing, how would you determine if traffic patterns during the event are consistent with historical norms? Knowing when an issue occurs will let you examine the puzzle pieces on either side to get a better picture of what was happening at or near that time.

As you might imagine, answering this question accurately is made much simpler with good historical monitoring in place. With a tool such as Percona Monitoring and Management (PMM), it is often a simple matter to review the metrics and see visually the exact moment when things went wrong.

The Times, They Are a-Changing

Now that we know when the issue occurred, you should next ask, “what changed prior to this happening?” A very common occurrence is for a production code change or some other system or OS change to negatively impact the database, especially if robust change and usage testing weren’t done beforehand. Databases don’t exist in a vacuum, so a problem occurring out of thin air is not something that commonly occurs. In nearly all cases, you’ll find a change was made that either directly caused the issue, or at least contributed to it in some way, possibly exacerbating a less impactful problem to priority status.

Knowing what has changed will also be critical to assist in finding the root cause of an issue after it has been triaged, as often the change itself will be the root cause. If it is not the cause directly, knowing what changed will at least ensure that you aren’t looking in the wrong places and put you on track to quickly and confidently documenting the cause.

May I Have Another?

The next question that you may want to tackle is determining if the issue is repeatable. We now know what the issue is, what has been done previously, when the problem actually happened, and what changes were made prior to the issue.  Can you now identify the exact circumstances that led up to the issue, and detail/test this scenario so that a reproducible test case is possible? This can be important, as having a reproducible test case will make any metrics collection during the actual issue relevant, whereas if this is a one-time issue, gathering relevant metrics will be difficult if not impossible.

This is also critical if you determine you need to pull in additional assistance for a given problem. Perhaps you have vendor support, or you engage a third-party support partner for your database environment (go Percona!). In these cases, being able to tell the support partner how to reproduce the issue will allow them to begin troubleshooting directly, and avoid the back and forth walls of text attempting to describe the issue in a ticket without a reproducible test case.

Wrapping Up

While it may seem counterintuitive, it is very helpful to step back and ask these simple questions before you begin work on triaging an issue, regardless of the complexity. Perhaps you do already know the answer to all of these questions, and in that case, you’ve at least confirmed your suspicions before proceeding with no harm done. In many cases, however, stepping back to ask these simple questions can reframe an entire conversation highlighting what is actually relevant and what is just background noise, saving both time and effort.

Percona is a leading provider of unbiased open source database solutions that allow organizations to easily, securely, and affordably maintain business agility, minimize risks, and stay competitive. To put Percona’s expertise to work for you, please contact us.

Share this post

Leave a Reply