One of our customers recently reported that MongoDB’s diagnostic metrics—collected via FTDC (Full-Time Diagnostic Data Capture)—had stopped updating. As a result, no metrics were being collected, either through diagnostic data files or the getDiagnosticData command. Key metrics such as uptime were no longer progressing. While the cluster remained fully operational, the lack of observability introduced a risk for both incident response and performance troubleshooting.

What followed was a detailed, multi-layer investigation led by Percona Support. This post walks through the full diagnostic process, the reasoning behind each step, and how a workaround was developed alongside the customer’s OS team.

Step 0: Context – A hidden, intermittent problem

This issue surfaced intermittently across multiple environments running Oracle Enterprise Linux (OEL) 7, 8, and 9, with MongoDB versions 6 and 7. There was no consistent trigger, clear pattern, or immediate way to reproduce the problem.

We enabled detailed logging using:

But even with increased verbosity, the MongoDB logs showed nothing unusual. Similarly, OS-level logs—including messages, syslog, and dmesg—were completely clean. No I/O errors, no warnings, no tracebacks.

This made the issue especially challenging to diagnose. With no visible errors, no reproducible test case, and no red flags in standard logs, we had to look deeper—into thread states, kernel behavior, and low-level system interactions that wouldn’t normally show up in application or sysadmin logs.

Step 1: Identifying the frozen thread

The first sign something was wrong came from the metrics—FTDC wasn’t updating. To begin investigating, we needed to locate:

  • The MongoDB process ID (PID)
  • The thread ID (TID) associated with the FTDC thread

We started with:

This gave us the main PID of the running MongoDB process. Example output:

With PID 923911 identified, we then listed all threads under that process to find the one labeled ftdc:

Here, we saw:

  • 923911 is the MongoDB process.
  • 923934 is the thread named ftdc, stuck in state DLl (uninterruptible sleep + multithreaded + low-priority).
  • WCHAN showed it was waiting on autofs_mount_wait, indicating an issue at the filesystem layer.

Next, we verified that this thread was indeed stuck by inspecting its full stat line:

The D state confirms that it’s in uninterruptible sleep, typically due to a blocked I/O operation, often at the kernel level.

To dig deeper, we extracted the kernel call stack for the FTDC thread:

This trace confirmed that the thread was blocked within statfs() while attempting to access an automounted filesystem—specifically /misc. We verified that /misc was indeed an autofs mount point by inspecting the active mounts using the mount -l command.

The statfs() system call retrieves statistics about a mounted filesystem, such as total and available space, block size, and the number of free inodes. For example, when MongoDB’s FTDC collects metrics, it might internally trigger statfs() on paths like /misc to gather storage data—even if those paths aren’t directly used by the database.

The combination of thread state, wait channel, and stack trace pointed us to a deeper issue in the Linux autofs subsystem. From here, we moved into root cause analysis.

Step 2: Why this affects FTDC

MongoDB’s FTDC thread collects filesystem metrics using statfs() across all mounted paths, not just ones used by the database.

Here are examples from the diagnostic output:

Key insight:

  • /misc is a NAS mount containing binaries—not a MongoDB resource.
  • However, FTDC scans it like any other mount.
  • If /misc becomes unresponsive, the ftdc thread may block as a result. In this case, the issue was triggered by upgrading or downgrading the autofs package during OS patching.

Autofs cannot safely restart when active mounts are present, as it depends on open file descriptors to perform control operations. These descriptors may no longer be available after a restart. Further details are documented in the Linux kernel source:

https://github.com/torvalds/linux/blob/c64d3dc69f38a08d082813f1c0425d7a108ef950/Documentation/filesystems/autofs-mount-control.rst

This explains why a database thread got stuck waiting on a file system unrelated to its config or operation.

Step 3: Trying to fix – What did not work

An early recommendation was to try this:

This attempts to mount any pending filesystems, with the hope that it would “unstick” the blocked automount.

Result: It didn’t help. Once a thread is in D state due to a stuck syscall, it cannot recover unless the underlying system call completes—which didn’t happen here.

reasons to switch from Mongodb to percona for mongodb

Step 4: Trying to fix – What actually worked, OS-level intervention

After deeper testing, the customer’s OS SME developed a working process to prevent the issue during system patching.

Here’s the exact command sequence that worked:

Why this worked:

  • Lazy unmounts (umount -l) remove the mount point from the namespace without waiting on in-use file handles
  • Restarting autofs ensured clean remounting afterward
  • This avoided the FTDC thread from touching stale or unresponsive mount points during OS patching

Step 5: Risk mitigation strategy

Since the core issue is systemic, the safest operational pattern is to disable FTDC before performing any autofs-related work:

Important: If the FTDC thread is already stuck, this won’t help. A MongoDB restart is required to clear the uninterruptible sleep.

Step 6: Feature requests and next steps

We escalated the issue upstream and submitted the following:

  • SERVER-103431 – Allow disabling individual FTDC metric groups like filesystem stats
  • SERVER-103432 – Add watchdog logic to restart FTDC when it becomes unresponsive
  • PSMDB-1649 – Percona tracking to improve diagnostics and prevention

Final thoughts

This MongoDB FTDC freeze wasn’t a bug—it was a kernel-level autofs issue manifesting through system-level telemetry. The freeze didn’t affect user queries, but it degraded monitoring and made troubleshooting harder.

What made a difference:

  • Clear, deep diagnostics across layers
  • Collaboration between Percona and the customer’s OS team
  • Willingness to test and validate real solutions

And if a thread goes into D state and doesn’t recover—you’re not alone. These issues often stem from outside the database: a stalled NFS mount, an autofs bug, or a system call like statfs() getting stuck on an unresponsive path. They’re subtle, hard to trace, and easy to misdiagnose—but the impact is real: blocked threads, frozen metrics, or degraded performance.

In this case, the Percona Support team worked across layers—MongoDB, OS, and kernel—to pinpoint the root cause and guide a reliable fix. This kind of cross-stack problem-solving is something we deal with regularly, and each case like this helps us support future customers faster and more effectively. If you’re seeing similar symptoms, we can help you track them down.


As your data grows, MongoDB’s hidden costs and limitations—whether Community Edition, Enterprise Advanced, or Atlas—can hold you back. Percona for MongoDB provides an enterprise-grade software and services alternative that is free from high costs, lock-in, or deployment limits.

 

7 Reasons to Switch from MongoDB to Percona for MongoDB

Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments