With MongoDB 8.0, the database engine takes another step forward in performance optimization, particularly in how it manages memory. One of the most impactful changes under the hood is the updated version of TCMalloc (Thread-Caching Malloc), which affects how the server allocates, caches, and reuses memory blocks.
For workloads with high concurrency, long-running queries, or mixed read/write patterns, the new TCMalloc can deliver noticeable performance gains.
This article explains what TCMalloc is, how it influences performance and memory fragmentation, and what differences you can expect before and after upgrading to MongoDB 8.0.
What is TCMalloc?
TCMalloc (Thread-Caching Malloc) is a memory allocator originally developed by Google. It replaces the standard malloc() and free() calls used by applications written in C/C++ with a faster, multithread-optimized alternative.
In simple terms, TCMalloc handles memory requests more efficiently by caching allocations per thread or per-CPU (default), avoiding the contention that can happen when multiple threads try to allocate or free memory at the same time.
TCMalloc may operate in one of two fashions:
- (default) per-CPU caching, where TCMalloc maintains memory caches local to individual logical cores.
- per-thread caching, where TCMalloc maintains memory caches local to each application thread.
In both cases, these cache implementations allow TCMalloc to avoid requiring locks for most memory allocations and deallocations. It ends in low memory fragmentation and reduced system calls that in the majority of cases provides better performance.
TCMalloc in MongoDB 8.0
MongoDB has used TCMalloc as its default allocator, but version 8.0 includes a major upgrade to a newer implementation aligned with upstream Google TCMalloc changes that uses per-CPU caches, instead of per-thread caches.
This brings improved multithreaded scalability, better memory release behavior to the OS, more predictable RSS (Resident Set Size) under heavy workloads.
The upgrade particularly benefits deployments where:
- Multiple shards or replica set members share the same host (not really recommended if you don’t use containers).
- Large in-memory datasets (working sets) are frequently changing, and you see increased number of evictions from the WiredTiger cache.
- Workloads generate many short-lived allocations (e.g., aggregation pipelines, complex queries, or analytical jobs).
Needless to say, because of this under the hood change, MongoDB 8.0 is declared to be faster than previous version 7.0 for a lot of use cases.
The official documentation says that MongoDB 8.0 introduces significant performance improvements from MongoDB 7.0, including, but not limited to:
- Up to 36% better read throughput.
- Up to 32% better performance for typical web applications.
- Up to 20% faster concurrent writes during replication.
Probably the improvement is not only from TCMalloc, but it could be the main contributor.
Important change for Transparent Huge Pages (THP)
If you are a long time user of MongoDB, you probably know that one of the more common best practices for OS tuning was to disable THP. Starting from MongoDB 8.0 the best practice is exactly the opposite: in order to benefit from the new TCMalloc, THP now must be enabled.
The following conditions must be checked to ensure TCMalloc can really use the new per-CPU caches:
- Kernel version 4.18 or later
- THP enabled
- glibc rseq disabled: if another application, such as the glibc library, registers an rseq structure before TCMalloc, TCMalloc can’t use rseq. Without rseq, TCMalloc uses per-thread caches, which are used by the legacy TCMalloc version.
A few details about Rseq (Restartable Sequences). Rseq lets user-space code execute small critical sections that are guaranteed to run atomically on the same CPU, without using locks or syscalls in the fast path. Some operations are extremely common and performance-critical, like: updating per-CPU counters, accessing per-CPU data structures, fast memory allocators and schedulers. In order to benefit of it, TCMalooc must be the one to register an rseq structure.
To verify that TCMalloc is running with per-CPU caches, ensure the following from the serverStatus:
- tcmalloc.usingPerCpuCaches is true
- tcmalloc.tcmalloc.cpu_free is greater than 0
Look at the following page for more details:
https://www.mongodb.com/docs/v8.0/administration/tcmalloc-performance/
Testing time
Let’s now do some tests running the same kind of workloads and compare MongoDB 7.0 vs MongoDB 8.0.
The servers used for the tests had the following specifications:
- 4 CPU
- 4 GB RAM
- OS Ubuntu 24.04 LTS
POCDriver was used to generate the workloads. Every test ran for 10 minutes on both servers using 4 parallel threads.
The two versions compared were Percona Server for MongoDB 7.0.26-14 and Percona Server for MongoDB 8.0.16-5.
Here are the results of the tests. Higher is better.
INTENSIVE INSERTS AND UPDATES WITH OTHER READS
avg ops per sec
| PSMDB 7.0 | PSMDB 8.0 | % improvement | |
| INSERTS | 55,784 | 71,752 | +28.62% |
| _id LOOKUPS | 1.883 | 2,529 | +34.31% |
| UPDATES | 17,178 | 17,963 | +4.57% |
| RANGE QUERIES | 753 | 874 | +16.07% |
INTENSIVE UPDATES AND RANGE QUERIES
avg ops per sec
| PSMDB 7.0 | PSMDB 8.0 | % improvement | |
| INSERTS | 0 | 0 | – |
| _id LOOKUPS | 0 | 0 | – |
| UPDATES | 64,091 | 78,568 | +22.59% |
| RANGE QUERIES | 411 | 565 | +37.47% |
INTENSIVE _id LOOKUPS WITH FEW UPDATES AND RANGE QUERIES
avg ops per sec
| PSMDB 7.0 | PSMDB 8.0 | % improvement | |
| INSERTS | 0 | 0 | – |
| _id LOOKUPS | 10.647 | 13,279 | +24.72% |
| UPDATES | 1,408 | 1,647 | +16.97% |
| RANGE QUERIES | 307 | 339 | +10.42% |
INTENSIVE RANGE QUERIES AND UPDATES
avg ops per sec
| PSMDB 7.0 | PSMDB 8.0 | % improvement | |
| INSERTS | 0 | 0 | – |
| _id LOOKUPS | 0 | 0 | – |
| UPDATES | 1,372 | 1,615 | +17.71% |
| RANGE QUERIES | 7,779 | 8,307 | +6.79% |

Conclusions
As promised by the official documentation, MongoDB 8.0 is really faster than MongoDB 7.0. The tests provided results that confirm the benefits declared. Obviously, the real benefits depend on multiple factors, like a customized tuning, a different hardware or other things. You could face a specific scenario that cannot provide the same kind of improvements we had. For this reason, running tests against a new version is always recommended before moving a version to production. Anyway, we are confident the benefits provided by the new TCMalloc with per-CPU caches are really impressive.


