One of our customers recently reported that MongoDB’s diagnostic metrics—collected via FTDC (Full-Time Diagnostic Data Capture)—had stopped updating. As a result, no metrics were being collected, either through diagnostic data files or the getDiagnosticData
command. Key metrics such as uptime were no longer progressing. While the cluster remained fully operational, the lack of observability introduced a risk for both incident response and performance troubleshooting.
What followed was a detailed, multi-layer investigation led by Percona Support. This post walks through the full diagnostic process, the reasoning behind each step, and how a workaround was developed alongside the customer’s OS team.
Step 0: Context – A hidden, intermittent problem
This issue surfaced intermittently across multiple environments running Oracle Enterprise Linux (OEL) 7, 8, and 9, with MongoDB versions 6 and 7. There was no consistent trigger, clear pattern, or immediate way to reproduce the problem.
We enabled detailed logging using:
1 |
db.setLogLevel(5) |
But even with increased verbosity, the MongoDB logs showed nothing unusual. Similarly, OS-level logs—including messages, syslog, and dmesg—were completely clean. No I/O errors, no warnings, no tracebacks.
This made the issue especially challenging to diagnose. With no visible errors, no reproducible test case, and no red flags in standard logs, we had to look deeper—into thread states, kernel behavior, and low-level system interactions that wouldn’t normally show up in application or sysadmin logs.
Step 1: Identifying the frozen thread
The first sign something was wrong came from the metrics—FTDC wasn’t updating. To begin investigating, we needed to locate:
- The MongoDB process ID (PID)
- The thread ID (TID) associated with the FTDC thread
We started with:
1 |
$ ps aux | grep mongod |
This gave us the main PID of the running MongoDB process. Example output:
1 |
mongodb 923911 0.7 1.2 4724504 198112 ? Ssl Apr13 14:03 /usr/bin/mongod --config /etc/mongod.conf |
With PID 923911 identified, we then listed all threads under that process to find the one labeled ftdc:
1 2 3 4 5 6 |
$ ps -L -p 923911 -o pid,tid,stat,pcpu,comm,wchan PID TID STAT %CPU COMMAND WCHAN 923911 923911 Ssl 0.7 mongod - 923911 923934 DLl 0.2 ftdc autofs_mount_wait ... |
Here, we saw:
- 923911 is the MongoDB process.
- 923934 is the thread named ftdc, stuck in state DLl (uninterruptible sleep + multithreaded + low-priority).
- WCHAN showed it was waiting on
autofs_mount_wait
, indicating an issue at the filesystem layer.
Next, we verified that this thread was indeed stuck by inspecting its full stat line:
1 2 |
$ cat /proc/923911/task/923934/stat 923934 (ftdc) D 1 923910 923910 0 -1 4210752 31209 0 93 0 247870 351806 0 0 20 0 213 0 2318303019 19163037696 3851779 ... |
The D state confirms that it’s in uninterruptible sleep, typically due to a blocked I/O operation, often at the kernel level.
To dig deeper, we extracted the kernel call stack for the FTDC thread:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
$ cat /proc/923911/task/923934/stack [<0>] autofs_wait+0x319/0x79d [<0>] autofs_mount_wait+0x49/0xf0 [<0>] autofs_d_automount+0xe6/0x1f0 [<0>] follow_managed+0x17f/0x2e0 [<0>] lookup_fast+0x135/0x2a0 [<0>] walk_component+0x48/0x300 [<0>] path_lookupat.isra.43+0x79/0x220 [<0>] filename_lookup.part.58+0xa0/0x170 [<0>] user_statfs+0x43/0xa0 [<0>] __do_sys_statfs+0x20/0x60 [<0>] do_syscall_64+0x5b/0x1b0 [<0>] entry_SYSCALL_64_after_hwframe+0x61/0xc6 |
This trace confirmed that the thread was blocked within statfs()
while attempting to access an automounted filesystem—specifically /misc
. We verified that /misc was indeed an autofs mount point by inspecting the active mounts using the mount -l
command.
The statfs()
system call retrieves statistics about a mounted filesystem, such as total and available space, block size, and the number of free inodes. For example, when MongoDB’s FTDC collects metrics, it might internally trigger statfs()
on paths like /misc
to gather storage data—even if those paths aren’t directly used by the database.
The combination of thread state, wait channel, and stack trace pointed us to a deeper issue in the Linux autofs subsystem. From here, we moved into root cause analysis.
Step 2: Why this affects FTDC
MongoDB’s FTDC thread collects filesystem metrics using statfs()
across all mounted paths, not just ones used by the database.
Here are examples from the diagnostic output:
1 2 3 |
systemMetrics.mounts./misc/XXX.available;447080562688;... systemMetrics.mounts./misc/XXX.free;447080562688;... systemMetrics.mounts./misc/XXX.capacity;4294967296000;... |
Key insight:
/misc
is a NAS mount containing binaries—not a MongoDB resource.- However, FTDC scans it like any other mount.
- If
/misc
becomes unresponsive, the ftdc thread may block as a result. In this case, the issue was triggered by upgrading or downgrading the autofs package during OS patching.
Autofs cannot safely restart when active mounts are present, as it depends on open file descriptors to perform control operations. These descriptors may no longer be available after a restart. Further details are documented in the Linux kernel source:
This explains why a database thread got stuck waiting on a file system unrelated to its config or operation.
Step 3: Trying to fix – What did not work
An early recommendation was to try this:
1 |
$ sudo mount -a |
This attempts to mount any pending filesystems, with the hope that it would “unstick” the blocked automount.
Result: It didn’t help. Once a thread is in D state due to a stuck syscall, it cannot recover unless the underlying system call completes—which didn’t happen here.
Step 4: Trying to fix – What actually worked, OS-level intervention
After deeper testing, the customer’s OS SME developed a working process to prevent the issue during system patching.
Here’s the exact command sequence that worked:
1 2 3 4 5 6 7 8 9 |
# Unmount paths that trigger FTDC stalls $ umount -l /misc/XXX $ umount -l /misc/XXX $ umount -l /misc # Perform OS maitenance # Restart autofs $ systemctl restart autofs |
Why this worked:
- Lazy unmounts (umount -l) remove the mount point from the namespace without waiting on in-use file handles
- Restarting autofs ensured clean remounting afterward
- This avoided the FTDC thread from touching stale or unresponsive mount points during OS patching
Step 5: Risk mitigation strategy
Since the core issue is systemic, the safest operational pattern is to disable FTDC before performing any autofs-related work:
1 2 3 4 5 6 7 |
// Step 1 – disable FTDC collection mongo> db.adminCommand({ setParameter: 1, diagnosticDataCollectionEnabled: false }) // Step 2 – perform mount/unmount, upgrades, or downgrades // Step 3 – re-enable FTDC mongo> db.adminCommand({ setParameter: 1, diagnosticDataCollectionEnabled: true }) |
Important: If the FTDC thread is already stuck, this won’t help. A MongoDB restart is required to clear the uninterruptible sleep.
Step 6: Feature requests and next steps
We escalated the issue upstream and submitted the following:
- SERVER-103431 – Allow disabling individual FTDC metric groups like filesystem stats
- SERVER-103432 – Add watchdog logic to restart FTDC when it becomes unresponsive
- PSMDB-1649 – Percona tracking to improve diagnostics and prevention
Final thoughts
This MongoDB FTDC freeze wasn’t a bug—it was a kernel-level autofs issue manifesting through system-level telemetry. The freeze didn’t affect user queries, but it degraded monitoring and made troubleshooting harder.
What made a difference:
- Clear, deep diagnostics across layers
- Collaboration between Percona and the customer’s OS team
- Willingness to test and validate real solutions
And if a thread goes into D state and doesn’t recover—you’re not alone. These issues often stem from outside the database: a stalled NFS mount, an autofs bug, or a system call like statfs()
getting stuck on an unresponsive path. They’re subtle, hard to trace, and easy to misdiagnose—but the impact is real: blocked threads, frozen metrics, or degraded performance.
In this case, the Percona Support team worked across layers—MongoDB, OS, and kernel—to pinpoint the root cause and guide a reliable fix. This kind of cross-stack problem-solving is something we deal with regularly, and each case like this helps us support future customers faster and more effectively. If you’re seeing similar symptoms, we can help you track them down.
As your data grows, MongoDB’s hidden costs and limitations—whether Community Edition, Enterprise Advanced, or Atlas—can hold you back. Percona for MongoDB provides an enterprise-grade software and services alternative that is free from high costs, lock-in, or deployment limits.
7 Reasons to Switch from MongoDB to Percona for MongoDB