Behind the Scenes: How Percona Support Diagnosed a MongoDB FTDC Freeze

One of our customers recently reported that MongoDB’s diagnostic metrics—collected via FTDC (Full-Time Diagnostic Data Capture)—had stopped updating. As a result, no metrics were being collected, either through diagnostic data files or the getDiagnosticData command. Key metrics such as uptime were no longer progressing. While the cluster remained fully operational, the lack of observability introduced a risk for both incident response and performance troubleshooting.

What followed was a detailed, multi-layer investigation led by Percona Support. This post walks through the full diagnostic process, the reasoning behind each step, and how a workaround was developed alongside the customer’s OS team.

Step 0: Context – A hidden, intermittent problem

This issue surfaced intermittently across multiple environments running Oracle Enterprise Linux (OEL) 7, 8, and 9, with MongoDB versions 6 and 7. There was no consistent trigger, clear pattern, or immediate way to reproduce the problem.

We enabled detailed logging using:

db.setLogLevel(5)

1	db.setLogLevel(5)

But even with increased verbosity, the MongoDB logs showed nothing unusual. Similarly, OS-level logs—including messages, syslog, and dmesg—were completely clean. No I/O errors, no warnings, no tracebacks.

This made the issue especially challenging to diagnose. With no visible errors, no reproducible test case, and no red flags in standard logs, we had to look deeper—into thread states, kernel behavior, and low-level system interactions that wouldn’t normally show up in application or sysadmin logs.

Step 1: Identifying the frozen thread

The first sign something was wrong came from the metrics—FTDC wasn’t updating. To begin investigating, we needed to locate:

The MongoDB process ID (PID)
The thread ID (TID) associated with the FTDC thread

We started with:

$ ps aux | grep mongod

1	$ ps aux \| grep mongod

This gave us the main PID of the running MongoDB process. Example output:

mongodb   923911  0.7  1.2 4724504 198112 ?      Ssl  Apr13  14:03 /usr/bin/mongod --config /etc/mongod.conf

1	mongodb 923911 0.7 1.2 4724504 198112 ? Ssl Apr13 14:03 /usr/bin/mongod --config /etc/mongod.conf

With PID 923911 identified, we then listed all threads under that process to find the one labeled ftdc:

$ ps -L -p 923911 -o pid,tid,stat,pcpu,comm,wchan

  PID   TID STAT %CPU COMMAND         WCHAN
923911 923911 Ssl   0.7 mongod         -
923911 923934 DLl   0.2 ftdc           autofs_mount_wait
...

$ ps -L -p 923911 -o pid,tid,stat,pcpu,comm,wchan

PID TID STAT %CPU COMMAND WCHAN

923911 923911 Ssl 0.7 mongod -

923911 923934 DLl 0.2 ftdc autofs_mount_wait

...

Here, we saw:

923911 is the MongoDB process.
923934 is the thread named ftdc, stuck in state DLl (uninterruptible sleep + multithreaded + low-priority).
WCHAN showed it was waiting on autofs_mount_wait, indicating an issue at the filesystem layer.

Next, we verified that this thread was indeed stuck by inspecting its full stat line:

$ cat /proc/923911/task/923934/stat
923934 (ftdc) D 1 923910 923910 0 -1 4210752 31209 0 93 0 247870 351806 0 0 20 0 213 0 2318303019 19163037696 3851779 ...

1 2	$ cat /proc/923911/task/923934/stat 923934 (ftdc) D 1 923910 923910 0 -1 4210752 31209 0 93 0 247870 351806 0 0 20 0 213 0 2318303019 19163037696 3851779 ...

The D state confirms that it’s in uninterruptible sleep, typically due to a blocked I/O operation, often at the kernel level.

To dig deeper, we extracted the kernel call stack for the FTDC thread:

$ cat /proc/923911/task/923934/stack
[<0>] autofs_wait+0x319/0x79d
[<0>] autofs_mount_wait+0x49/0xf0
[<0>] autofs_d_automount+0xe6/0x1f0
[<0>] follow_managed+0x17f/0x2e0
[<0>] lookup_fast+0x135/0x2a0
[<0>] walk_component+0x48/0x300
[<0>] path_lookupat.isra.43+0x79/0x220
[<0>] filename_lookup.part.58+0xa0/0x170
[<0>] user_statfs+0x43/0xa0
[<0>] __do_sys_statfs+0x20/0x60
[<0>] do_syscall_64+0x5b/0x1b0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xc6

$ cat /proc/923911/task/923934/stack

[<0>] autofs_wait+0x319/0x79d

[<0>] autofs_mount_wait+0x49/0xf0

[<0>] autofs_d_automount+0xe6/0x1f0

[<0>] follow_managed+0x17f/0x2e0

[<0>] lookup_fast+0x135/0x2a0

[<0>] walk_component+0x48/0x300

[<0>] path_lookupat.isra.43+0x79/0x220

[<0>] filename_lookup.part.58+0xa0/0x170

[<0>] user_statfs+0x43/0xa0

[<0>] __do_sys_statfs+0x20/0x60

[<0>] do_syscall_64+0x5b/0x1b0

[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xc6

This trace confirmed that the thread was blocked within statfs() while attempting to access an automounted filesystem—specifically /misc. We verified that /misc was indeed an autofs mount point by inspecting the active mounts using the mount -l command.

The statfs() system call retrieves statistics about a mounted filesystem, such as total and available space, block size, and the number of free inodes. For example, when MongoDB’s FTDC collects metrics, it might internally trigger statfs() on paths like /misc to gather storage data—even if those paths aren’t directly used by the database.

The combination of thread state, wait channel, and stack trace pointed us to a deeper issue in the Linux autofs subsystem. From here, we moved into root cause analysis.

Step 2: Why this affects FTDC

MongoDB’s FTDC thread collects filesystem metrics using statfs() across all mounted paths, not just ones used by the database.

Here are examples from the diagnostic output:

systemMetrics.mounts./misc/XXX.available;447080562688;...
systemMetrics.mounts./misc/XXX.free;447080562688;...
systemMetrics.mounts./misc/XXX.capacity;4294967296000;...

systemMetrics.mounts./misc/XXX.available;447080562688;...

systemMetrics.mounts./misc/XXX.free;447080562688;...

systemMetrics.mounts./misc/XXX.capacity;4294967296000;...

Key insight:

/misc is a NAS mount containing binaries—not a MongoDB resource.
However, FTDC scans it like any other mount.
If /misc becomes unresponsive, the ftdc thread may block as a result. In this case, the issue was triggered by upgrading or downgrading the autofs package during OS patching.

Autofs cannot safely restart when active mounts are present, as it depends on open file descriptors to perform control operations. These descriptors may no longer be available after a restart. Further details are documented in the Linux kernel source:

https://github.com/torvalds/linux/blob/c64d3dc69f38a08d082813f1c0425d7a108ef950/Documentation/filesystems/autofs-mount-control.rst

This explains why a database thread got stuck waiting on a file system unrelated to its config or operation.

Step 3: Trying to fix – What did not work

An early recommendation was to try this:

$ sudo mount -a

1	$ sudo mount -a

This attempts to mount any pending filesystems, with the hope that it would “unstick” the blocked automount.

Result: It didn’t help. Once a thread is in D state due to a stuck syscall, it cannot recover unless the underlying system call completes—which didn’t happen here.

Step 4: Trying to fix – What actually worked, OS-level intervention

After deeper testing, the customer’s OS SME developed a working process to prevent the issue during system patching.

Here’s the exact command sequence that worked:

# Unmount paths that trigger FTDC stalls
$ umount -l /misc/XXX
$ umount -l /misc/XXX
$ umount -l /misc

# Perform OS maitenance

# Restart autofs
$ systemctl restart autofs

# Unmount paths that trigger FTDC stalls

$ umount -l /misc/XXX

$ umount -l /misc

# Perform OS maitenance

# Restart autofs

$ systemctl restart autofs

Why this worked:

Lazy unmounts (umount -l) remove the mount point from the namespace without waiting on in-use file handles
Restarting autofs ensured clean remounting afterward
This avoided the FTDC thread from touching stale or unresponsive mount points during OS patching

Step 5: Risk mitigation strategy

Since the core issue is systemic, the safest operational pattern is to disable FTDC before performing any autofs-related work:

// Step 1 – disable FTDC collection
mongo> db.adminCommand({ setParameter: 1, diagnosticDataCollectionEnabled: false })

// Step 2 – perform mount/unmount, upgrades, or downgrades

// Step 3 – re-enable FTDC
mongo> db.adminCommand({ setParameter: 1, diagnosticDataCollectionEnabled: true })

// Step 1 – disable FTDC collection

mongo> db.adminCommand({ setParameter: 1, diagnosticDataCollectionEnabled: false })

// Step 2 – perform mount/unmount, upgrades, or downgrades

// Step 3 – re-enable FTDC

mongo> db.adminCommand({ setParameter: 1, diagnosticDataCollectionEnabled: true })

Important: If the FTDC thread is already stuck, this won’t help. A MongoDB restart is required to clear the uninterruptible sleep.

Step 6: Feature requests and next steps

We escalated the issue upstream and submitted the following:

SERVER-103431 – Allow disabling individual FTDC metric groups like filesystem stats
SERVER-103432 – Add watchdog logic to restart FTDC when it becomes unresponsive
PSMDB-1649 – Percona tracking to improve diagnostics and prevention

Final thoughts

This MongoDB FTDC freeze wasn’t a bug—it was a kernel-level autofs issue manifesting through system-level telemetry. The freeze didn’t affect user queries, but it degraded monitoring and made troubleshooting harder.

What made a difference:

Clear, deep diagnostics across layers
Collaboration between Percona and the customer’s OS team
Willingness to test and validate real solutions

And if a thread goes into D state and doesn’t recover—you’re not alone. These issues often stem from outside the database: a stalled NFS mount, an autofs bug, or a system call like statfs() getting stuck on an unresponsive path. They’re subtle, hard to trace, and easy to misdiagnose—but the impact is real: blocked threads, frozen metrics, or degraded performance.

In this case, the Percona Support team worked across layers—MongoDB, OS, and kernel—to pinpoint the root cause and guide a reliable fix. This kind of cross-stack problem-solving is something we deal with regularly, and each case like this helps us support future customers faster and more effectively. If you’re seeing similar symptoms, we can help you track them down.

As your data grows, MongoDB’s hidden costs and limitations—whether Community Edition, Enterprise Advanced, or Atlas—can hold you back. Percona for MongoDB provides an enterprise-grade software and services alternative that is free from high costs, lock-in, or deployment limits.

7 Reasons to Switch from MongoDB to Percona for MongoDB

MySQL 5.7
Support

Compare Percona to Leading Database Solutions

Software
Downloads

Valkey Contribution

Product Documentation

Resource Hub

Why Percona for MongoDB?

Why Percona for PostgreSQL?

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Behind the Scenes: How Percona Support Diagnosed a MongoDB FTDC Freeze

Step 0: Context – A hidden, intermittent problem

Step 1: Identifying the frozen thread

Step 2: Why this affects FTDC

Key insight:

Step 3: Trying to fix – What did not work

Step 4: Trying to fix – What actually worked, OS-level intervention

Step 5: Risk mitigation strategy

Step 6: Feature requests and next steps

Final thoughts

Related Blog Articles

RECOMMENDED ARTICLES

Building the Future of MySQL: Announcing Plans for MySQL Vector Support and a MySQL Binlog Server

Analyzing the Heartbeat of the MySQL Server: A Look at Repository Statistics

How to Deploy a Stand-By/Ad-Hoc Cluster Based on Percona Operator for PostgreSQL

MOST POPULAR ARTICLES

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL Performance Tuning: Maximizing Database Efficiency and Speed

The Ultimate Guide to Open Source Databases

MySQL 5.7 Support

Compare Percona to Leading Database Solutions

Software Downloads

Valkey Contribution

Product Documentation

Resource Hub

Why Percona for MongoDB?

Why Percona for PostgreSQL?

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Behind the Scenes: How Percona Support Diagnosed a MongoDB FTDC Freeze

Step 0: Context – A hidden, intermittent problem

Step 1: Identifying the frozen thread

Step 2: Why this affects FTDC

Key insight:

Step 3: Trying to fix – What did not work

Step 4: Trying to fix – What actually worked, OS-level intervention

Step 5: Risk mitigation strategy

Step 6: Feature requests and next steps

Final thoughts

About the Author

Share This Post!

Stay up to date with the Percona Blog

Related Blog Articles

RECOMMENDED ARTICLES

Building the Future of MySQL: Announcing Plans for MySQL Vector Support and a MySQL Binlog Server

Analyzing the Heartbeat of the MySQL Server: A Look at Repository Statistics

How to Deploy a Stand-By/Ad-Hoc Cluster Based on Percona Operator for PostgreSQL

MOST POPULAR ARTICLES

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL Performance Tuning: Maximizing Database Efficiency and Speed

The Ultimate Guide to Open Source Databases

MySQL 5.7
Support

Software
Downloads