Impact of memory allocators on MySQL performance

MySQL server intensively uses dynamic memory allocation so a good choice of memory allocator is quite important for the proper utilization of CPU/RAM resources. Efficient memory allocator should help to improve scalability, increase throughput and keep memory footprint under the control. In this post I’m going to check impact of several memory allocators on the performance/scalability of MySQL server in the read-only workloads.

For my testing i chose following allocators: lockless, jemalloc-2.2.5, jemalloc-3.0, tcmalloc(gperftools-2.0), glibc-2.12.1(new malloc)(CentOS 6.2), glibc-2.13(old malloc), glibc-2.13(new malloc), glibc-2.15(new malloc).

Let me clarify a bit about malloc in glibc. Starting from glibc-2.10 it had two malloc implementations that one can choose with configure option –enable-experimental-malloc. (You can find details about new malloc here). Many distros switched to this new malloc in 2009. From my experince this new malloc behaved not always efficiently with MySQL so i decided to include old one to comparison as well. I used glibc-2.13 for that purpose because later –enable-experimental-malloc option was removed from glibc sources.

I built all allocators from sources(except system glibc 2.12.1) with stock CentOS gcc(version 4.4.6 20110731). All of them were built with -O3. I used LD_PRELOAD for lockless, jemalloc-2.2.5, jemalloc-3.0, tcmalloc and for glibc I prefixed mysqld with:

  • Testing details: 
    • Cisco USC_C250 box
    • Percona Server 5.5.24
    • 2 read only scnearios: OLTP_RO and POINT_SELECT from the latest sysbench-0.5
    • dataset consists of 4 sysbench tables(50M rows each) ~50G data / CPU bound case
    • innodb_buffer_pool_size=52G
  • For every malloc allocator perform the following steps:
    • start Percona server either with LD_PRELOAD=[] or glibc prefix(see above)/get RSS/VSZ size of mysqld
    • warmup with ‘select avg(id) from sbtest$i FORCE KEY (PRIMARY)’ and then OLTP_RO for 600sec
    • run OLTP_RO/POINT_SELECT test cases, duration 300 sec and vary number of threads: 8/64/128/256/512/1024/1536
    • stop server/get RSS/VSZ size of mysqld

The best throughput/scalability we have with lockless/jemalloc-3.0/tcmalloc. jemalloc-2.2.5 slightly drops with higher number of threads. On the graph with response time(see below) there are spikes for it that may be caused by some contention in the lib. All variations of glibc that are based on new malloc with increasing concurrency demonstrate notable drops – almost two times at high threads. In the same time glibc-2.13 built with old malloc looks good, results are very similar to lockless/jemalloc-3.0/tcmalloc.

For POINT_SELECT test with increasing concurrency we have two allocators that handle load very well – tcmalloc and only slightly behind … glibc-2.13 with old malloc. Then we have jemalloc-3.0/lockless/jemalloc-2.2.5 and last ones are glibc allocators based on new malloc. Along with the best throughput/scalability runs with tcmalloc also demonstrate best response time (30-50 ms at the high threads).

Besides throughput and latency there is one more factor that should be taken into account – memory footprint.

memory allocator mysqld RSS size grow(kbytes) mysqld VSZ size grow(kbytes)
lockless 6.966.736 105.780.880
jemalloc-2.2.5 214.408 3.706.880
jemalloc-3.0 216.084 5.804.032
tcmalloc 456.028 514.544
glibc-2.13-new-malloc 210.120 232.624
glibc-2.13-old-malloc 253.568 1.006.204
glibc-2.12.1-system 162.952 215.064
glibc-2.15-new-malloc 5.106.124 261.636


The only two allocators lockless and glibc-2.15-with new malloc notably incressed RSS memory footprint of mysqld server – more than on 5G. Memory usage for others allocators looks more or less acceptable.

Taking into account all 3 factors – throughput, latency and memory usage for above POINT_SELECT/OLTP_RO type of workloads the most suitable allocators are tcmalloc, jemalloc-3.0 and glibc-2.13 with old malloc.

Important point to take is that new glibc with new malloc implementation may be NOT suitable and may show worse results than on older platforms.


To cover some questions raised in the comments I rerun OLTP_RO/POINT_SELECT tests with jemalloc-2.2.5/jemalloc-3.0/tcmalloc, varied /sys/kernel/mm/transparent_hugepage/enabled(always|never) and gathered mysqld size with ‘ps –sort=-rss -eopid,rss,vsz,pcpu’ during the test run. Just to remind whole test run cycle looks like following:
start server, warmup, OLTP_RO test, POINT_SELECT test. So on charts below you will see how mysqld footprint is changed during the test cycle and what is the impact of disabling of hugepages.

You can read Part 2 of this topic here.

Share this post

Comments (15)

  • Bradley C Kuszmaul

    Is this on Centos 6 with transparent huge pages enabled? I’ve found that transparent huge pages make jemalloc’s memory footprint much larger, otherwise without transparent huge pages, jemalloc tends to have a smaller footprint than glibc.

    July 5, 2012 at 12:23 pm
    • Alexey Stroganov


      Yes, that is Centos 6.2 with huge pages enabled. I’ve run several additional tests(see my last charts) and can confirm that with huge pages enabled memory footprint notably large.

      July 9, 2012 at 7:17 pm
  • Hander

    And on windows? Any benchmark?

    July 5, 2012 at 12:31 pm
    • Alexey Stroganov


      Sorry, don’t have Windows box for the testing. It seems that there are several malloc replacements around for Win platform (like nedmalloc) and indeed it would be nice to try something similar there.

      July 9, 2012 at 7:24 pm
  • gebi

    Your last memory accounting statistics don’t take into account the local memory caching the allocators use internally.
    Eg. with tcmalloc it greatly depends on when you do your statistics as it has thread local memory pools and may not instantly free your memory.

    July 5, 2012 at 1:33 pm
  • Mark Callaghan

    At the risk of asking for too much work from you, I agree with gebi that it would be good to see RSS during the test. jemalloc also has per-thread malloc caches that can be disabled via configuration options.

    I cannot figure out the RSS metric. For jemalloc 3.0 did RSS grow by ~200 kb or 200,000 kb. My confusion might be from US versus non-US style.

    July 5, 2012 at 8:06 pm
    • Alexey Stroganov


      >At the risk of asking for too much work from you, I agree with gebi that it would be good to see RSS during the test. jemalloc also has per-thread malloc caches that can be disabled via configuration options.

      Done. check my latest charts for jemalloc-2.2.5/3.0 and tcmalloc.

      re: RSS metric – yes, agree it looks not really clear – that is 200,000kB=~200MB.

      July 9, 2012 at 7:27 pm
  • Dimitri

    Many thanks for sharing it! Very useful stuff!

    Regarding Mark’s request to graph RSS during the test — you already know how to do it easily with dim_STAT 😉


    July 6, 2012 at 12:26 pm
    • Alexey Stroganov


      You are welcome!
      Sure, I know. Btw: when the next improved version will be released? 🙂

      July 9, 2012 at 7:29 pm
  • Raghavendra

    Quite interesting.

    1. Since we are comparing malloc performance, it is quite possible that something else may have worsened the performance of glibc over versions other than just the malloc (since glibc is not just malloc whereas others are just mallocs), though in case of glibc 2.13 it looks like clear because you have same turned off and on.

    1.1 On a related note, may be we can have a comparison between corresponding libcs — musl, uClibc,dietlibc and glibc (they seem to be on different mallocs of their own); also this way we can compare the whole stack – pthread, malloc etc.

    2. Regarding the RSS size, I would be careful before judging because -O3 generally causes compiler to be more lenient wrt. space.

    2.1 Also, I noticed in “I built all allocators from sources(except system glibc 2.12.1) with stock CentOS gcc(version 4.4.6 20110731). All of them were built with -O3” — so glibc 2.12.1 seems to have been built with -O2 (default for distro builds) and others with O3, is that right?

    3. Another important measure is when these allocators switch from sbrk to mmap. For glibc, according to documentation, it should be MMAP_THRESHOLD which is 128 kB though in later glibcs the boundary may have become fuzzy, for other allocators it will need to be checked.

    4. Regarding tcmalloc, it is known for being more memory hungry, so numbers are not surprising. (I guess the ‘hunger’ is not a bug since for google environment it works better).

    5. Regarding ‘new’ glibc malloc, the details are quite scary if is anything to by. The migration from per-core pool to per-thread pool may have been bad, since assuming a cap of 8 threads per core per thread, for 100 threads and 24 cores, it can turn into a maximum of 192,000 pools on a 64 bit system.

    July 8, 2012 at 3:19 pm
  • Raghavendra

    Regarding the transparent huge pages, it is related to . The defrag / compaction of THP seems to be the root cause (basically it makes the ‘reclaim’ synchronous). It is fixed in 3.3 linux and above, though CentOS/RHEL kernels may have backported them.
    Interestingly, from it looks like that only CentOS has it enabled whereas the upstream RHEL does not.

    July 10, 2012 at 10:15 am
  • Mark Horsburgh

    Really interesting.

    You’ve already done an enormous amount of work here so I’m certainly not suggesting you do any more. However, we’ve had a lot of success with the scalable allocator from Intel’s Thread Building Blocks library. It’s particularly good when you have a reasonably heavily threaded workload. I don’t know if mysql would fit into that category.

    December 12, 2014 at 3:29 am
  • Pengcheng Li

    Hi, Alexey,

    May I ask your configuration of MySQL, i.e. my.cnf file of it?

    And, I am evaluating my allocator on MySQL. I only got the OLTP benchmark suit from Are there essential difference between it with the test programs of yours ? Can you kindly please tell me where to get your benchmarks? I want to replay your experiments, because they are important to what I am doing.

    I saw your varied different number of threads for your benchmarks? Do you use the same number of threads for MySQL? And how did you configure the thread counts for MySQL?



    May 12, 2015 at 11:13 am
  • Omer Katz (@the_drow)

    Can you share the benchmark code?
    I’d like to test with jemalloc 4.x as well.

    February 11, 2016 at 12:13 am
    • Pengcheng Li

      Do you have other benchmarks to share? Thanks,

      February 11, 2016 at 9:14 am

Comments are closed.

Use Percona's Technical Forum to ask any follow-up questions on this blog topic.