InnoDB memory allocation, ulimit, and OpenSUSESasha Pachev
I recently encountered an interesting case. A customer reported that mysqld crashed on start on OpenSUSE 11.2 kernel 184.108.40.206-0.2-desktop x86_64 Â with 96 GB RAM when the innodb_buffer_pool_size was set to anything more than 62 GB. I decided to try it with 76 GB. The error message was an assert due to a failed malloc() in ut_malloc_low() in ut/ut0mem.c inside InnoDB source code. InnoDB wraps the majority of its memory allocations in ut_malloc_low(), so to get an idea of the pattern of requested allocations I added a debugging fprintf() to tell me how much was being allocated and whether it was successful.
I discovered something interesting. I expected the allocation to fail on the 76 GB of the buffer pool, due to some weird memory mapping issue and a continuous block of 76 GB not being available. However, that is not what happened. 76 GB buffer was allocated successfully. What was failing is the allocation of 3.37GB after that. What in the world could InnoDB need that was 3.37 GB? There was nothing in the settings that asked for anything close to 3 GB explicitly.
Source code is the ultimate documentation, and I took advantage of that. My good friend GDB guided me to buf_pool_init() in buf/buf0buf.c. There I found the following:
buf_pool->frame_mem = os_mem_alloc_large(
UNIV_PAGE_SIZE * (n_frames + 1),
That was the buffer pool itself, the 76 GB of it. And now the buffer pool’s friend:
buf_pool->blocks = ut_malloc(sizeof(buf_block_t) * max_size);
3.6 GB of it!
From the comments in the code (InnoDB code actually has very good comments), max_size is the maximum number of buffer pool pages (16K each), n_frames which is the same thing unless AWE is used, but it was not used, so I did not worry about it.
What shall we call that friend? It is used for storing some meta information about buffer pool pages. The most natural name I could come up with from reading the source code is the blocks array.
Thus we can see that we are allocating another chunk that is in proportion to the setting of innodb_buffer_pool_size for the blocks array. The exact proportions will probably vary from version to version, but roughly about 1 G for every 25 G of the buffer pool. This can become significant in the proper innodb_buffer_pool_size estimations when the system has a lot of RAM and you want to have the largest possible innodb_buffer_pool_size. Do not forget to give the blocks array some room!
While this was an interesting investigation, it nevertheless did not explain why there was not enough room for a 76 GB buffer pool. Even with the extra 3.37 GB allocation, there was still some free memory. Or was there? Maybe some hidden monster was eating it up? I quickly wrote this hack to prove or disprove the monster’s presence.
I verified that I could allocate and initialize two chunks of 40 GB from two separate processes, but not 80 GB from one. In fact, 80GB allocation failed right in malloc(), did not even get to initialization. I tested it with allocating 70 GB concurrently in each process so as to overrun physical memory + swap. Both allocations were successful, one initialized successfully, the other was killed by the OOM kill during initialization.
This smelled like a low ulimit, and sure enough it was. ulimit -m ulimited; ulimit -v unlimited did the magic, and mysqld successfully started with an 80 GB buffer pool. Apparenly OpenSUSE defaults are set in proportion to physical memory to keep the memory-hungry applications from taking the system down. On this particular system (96 GB physical memory, 2 GB swap it decided to set the virtual memory ulimit (-v) to 77.27 GB, and the physical memory (-m) to 80.40 GB).