Buy Percona ServicesBuy Now!

Finding MySQL Scaling Problems Using perf

 | February 5, 2020 |  Posted In: MariaDB, MySQL

The thing I wish I’d learned while still a DBA is how to use perf. Conversely after moving to a developer role, getting access to real external client workloads to get a perf recording directly is rare. To bridge this gap, I hope to encourage a bit of perf usage to help DBAs report bugs/feature requests in more detail to MySQL developers, who can then serve your needs better.

A recent client request showed how useful perf is in exposing the areas of MySQL that are otherwise well tuned, but can still be in need of coding improvements that increase throughput. The client had a TPCCRunner (variant) workload that they wanted to run on a Power 9 CPU, in READ-COMMITTED mode, and they received less performance than they hoped. Being a 2 socket, 20 cpus/socket 4 threads per core, and 256G RAM total it has enough resources.

With such abundance of resources, the perf profile exposed code bottlenecks not normally seen.

The principles driving MySQL development for a considerable time have been to a) maintain correctness, and b) deliver performance, usually meaning the CPU should be the bottleneck. The whole reason for large innodb buffer pools, innodb MVCC / LSN, group commit, table caches, thread caches, indexes, query planner etc, is to ensure that all hot data is in memory, ready to be processed optimally in the most efficient way by the CPU.

Based on this principle, without a requirement to sync to persistent storage for durability, a SQL read mostly load should be able to add scale linearly up to the CPU capacity. Ideally after the CPU capacity has been reached the throughput should stay at the capacity limit and not degrade. Practical overheads of thread measurement mean this is never perfectly achieved. However, it is the goal.

Steps to using perf

To install and use perf, use the following steps:

1. Install perf

This is a standard package and is closely tied to the Linux kernel version. The package name varies per distro:

Distributions normally set the sysctl kernel.perf_event_paranoid to a level which is hard to use (or exploit) and this may need to be adjusted to obtain our recording. Large perf recordings due to hardware threads can require file descriptors and memory, and their limits may need to be increased with care (see kernel manual).

2. Install debug symbols (a.k.a. debug info) for MySQL

Debug symbols mapping memory addresses to real server code can assist greatly in understanding the recorded results. The debug info needs to map to the exact build of MySQL (both version number and its origin).

Distros provide debug information in separate package repositories (distribution instructions: Ubuntu, Debian, RHEL, Fedora) and MySQL, MariaDB and Percona provide debug info packages in their repositories without additional configuration.

If compiling from source, the default cmake option -DCMAKE_BUILD_TYPE=RelWithDebugInfo  has debug info as the name suggests.

3. Ensure that your table structures and queries are sane.

MySQL works well when the database table structures, indexes, and queries are in a natural simple form. Asking MySQL developers to make poor table structures/queries to achieve greater performance will attract a low priority as making these changes can add to the overhead of simple queries.

4. Ensure that you have tuned the database for the workload.

MySQL has a lot of system variables, and using the performance schema and status variables assists in creating an optimally tuned MySQL instance before beginning perf measurements.

5. Ensure that the active data is off disk

To ensure you measurement is at its maximum, having the hot part of the data loaded into memory enables perf to focus on recording CPU related areas under stress, not just waiting to load from disk.

For example, the TPCCRunner example described earlier took about an hour before it reached a point where it achieved its maximum transaction throughput. TPCRunner displays this, but generally watch for a leveling out of the queries per second over several minutes.

When starting/stopping mysqld for testing, innodb_buffer_pool_dump_at_shutdown=1 / innodb_buffer_pool_dump_at_start=1 / innodb_buffer_pool_dump_pct=100 will help restore the innodb buffer pool significantly quicker.

6. Know what workload is being measured

A batch job may not have the same throughput requirements. It also may impact the concurrent workload that you are perf recording by creating longer history length, innodb buffer pool pressure etc.

The application that generates the workload should be on a different server, different VM or in some way constrained in CPU to avoid resource contention with mysqld. Check the client side to ensure that it isn’t overloaded (CPU, network) as this could be indirectly constraining the server side workload.

Measuring

With a hot workload running let’s start some measurement.

Perf uses hardware (PMU) to assist its recording work, but there are limits to hardware support so there’s a point where it will affect your workload, so start slow. Perf works by looking at a frequency distribution of where the mysqld process is spending its time. To examine a function that is taking 0.1% of the time means that 1000 samples will likely show it once. As such a few thousand samples is sufficient. The number of samples is the multiplication of perf record’s -F / –freq – which may by default be several thousand / second – the recording duration, and the number of CPUs.

If your SQL queries are all running in much less than a second and occurring frequently, then a high frequency recording for a short duration is sufficient. If some query occurs less often, with a high CPU usage spike, a helper program FlameScope will be able to narrow down a perf recording to a usable sample interval.

Analysis involves looking through a number of sets of data. Below I show a pattern of using name as a shell variable, and a large one line command to conduct a number of recordings in sequence. In my case, I cycled through RC  (read-committed) vs RR (repeatable read), different compile options -O0 , kernel versions, final stages of warmup (compared to test run) and even local changes to mysqld ( thread_local_ut_rnd_ulint_counter ). Keeping track of these alongside the same measurement of the test run output helps to correlate results more easily.

With the above command, the recording is constrained the recording to mysqld (-p $pid), at -F 10 samples/per second for (sleep) 20 seconds. A longer recording without the stack trace (-g) is taken as reference point to see if the shorter recording with -g stack trace is a fair sample. 10 hz x 20 seconds may not seem like many samples, however this occurred on each of the 160 threads. A record with -g is needed as a perf profile that shows all time in the kernel or pthread mutex (lock) code, but it doesn’t mean much without knowing which lock it is and where it was accessed from.

Perf record with -g call-graph (also known as stack chain or backtrace) adds to the size of the recording and the overhead of measurements. To ensure that there isn’t too much perf data (resulting in workload stalls), get the right frequency and duration before enabling -g.

Perf stats were measured to identify (cpu) cache efficiency, instructions/cycle efficiency, instructions throughput (watch out for frequency scaling), faults (connecting real memory to the virtual address – should be low after warmup), and migrations between numa nodes.

During measurement look at htop/top to ensure that the CPUs are indeed loaded. Also check the client side isn’t flooded with connection errors that could impact the validity of the recorded results.

Analysis

Viewing a perf recording

perf report is used to textually view a perf recording. It is during the report stage that the debug info is read, since the linux kernel image resolves symbols. Run the report under nice -n 19 perf report to ensure it has the lowest CPU priority if you are at all concerned about production impacts. It’s quite possible to do this on a different server provided the same kernel and MySQL packages are installed.

perf report --input mysql-5.7.28-event_run2_warmup_run1.perf --stdio

This shows a %CPU time measured when the CPU instruction pointer was at a particular time, grouped by the function name. To find out why malloc or the kernel do_syscall_64 appears so often the stack recording is needed.

Viewing a perf recording with a stack

When the perf record used -g, then -g can be used in perf report to show the breakdown. By default it groups the functions, including the functions it calls, as below.

perf report -i mysql-5.7.28-event_run2_warmup_run1.g.perf

In MySQL, as expected, most significant CPU load is in the threads. Most of the time this is a user connection, under the handle_connection function, which parses and executes the SQL. In different situations you might see innodb background threads, or replication threads: understanding which thread is causing the load is important at a top level. Then, to continue analysis, use the perf report –no-children option. This will show approximately the same as the non-g recording, however it will provide the mechanism of being able to hit Enter on a function to show all the call stacks that go to that particular function.

perf report -g --no-children --input mysql-5.7.28-event_run2_warmup_run1.g.perf

This shows a common call stack into handle_query, where the JOIN::exec and st_select_lex::optimize is the diverging point. If the evaluate_join_record and other sub-functions were to be expanded, the bottom level of the call graph would show rec_get_offsets_func.

Disassembly (annotation)

In the ncurses interfaces. Selecting ‘a’ (annotate) on a particular function calls out to the objdump (binutils) disassembler to show where in this function the highest frequency occurred and maps this to a commented C++ code above it.

As compilers have significant understanding of the architecture, and given that the C/C++ language provides significant freedom in generating code, it’s sometimes quite difficult to parse from assembly back to the C/C++ source. In complex operations, C++ variables don’t have an easy translation to CPU registers. Inlined functions are also particularly hard as each inlining can further be optimized to a unique assembly depending on its location. To understand the assembly, I recommend focusing on the loads, stores, maths/conditions with constants and branches to see which register maps to which part of the MySQL server code in the context of the surrounding code.

E.g annotation on rec_get_offsets_func:

Here we see that dict_table_is_comp is an expanded inline function at the top of rec_get_offsets, the movzlb .. %eax. The dominate CPU use in the function however isn’t part of this. The testb 0x1 (DICT_TF_COMPACT) … %rsi is the testing of the flag with je afterwards to return from the function.

Example – mutex contention

Compared to the performance profile on x86 above under ‘Viewing a perf recording’, this is what the performance profile looked like on POWER.

perf report --input mysql-5.7.28-read_mostly_EVENT_RC-run2.perf --stdio

What stands out clearly is the top two entries that didn’t appear on x86. Looking closer at MVCC::view_open:

perf report -g --no-children --input mysql-5.7.28-read_mostly_EVENT_RC-run2.g.perf

Annotation of MVCC::view_open

Due to the inline of code, within MVCC::view_open one of the mutexs got expanded out and the random number is used to spinlock wait for the lock again. PolicyMutex<TTASEventMutex<GenericPolicy> >::enter expanded to exactly the same code.

We see here that the load (ld) into r8, is the slowest part of this. In mysql-5.7.28, ut_rnd_ulint_counter is an ordinary global variable, meaning its shared between threads. The simple line of code ut_rnd_ulint_counter = UT_RND1 * ut_rnd_ulint_counter + UT_RND2,  shows the result stored back in the same variable. To understand why this didn’t scale, we need to understand cache lines.

note: MVCC::view_open did show up in the x86 profile, at 0.23% and had the lock release as the highest cpu point. For x86 PolicyMutex<TTASEventMutex<GenericPolicy> >::enter was at 0.32%.

Cache Lines

All modern CPUs that are likely to support MySQL will have some from of cache hierarchy. The principles are that a largely accessed memory location, like ut_rnd_ulint_counter, can be copied into cache and at some point the CPU will push it back to memory. To ensure behavior is consistent, cache lines represent a MMU (memory management unit) concept of a memory allocation to a particular CPU. Cache lines can be read only, or exclusive, and a protocol between CPU cores exists to ensure that exclusive access is to one CPU only. When one CPU modifies a memory location it gains an exclusive cache line, and the cached value in other CPU caches are flushed. At which cache level this flushing occurs at, and to what extent are caches shared between CPUs, is quite architecture specific. However, citing rough metrics, cache access is orders of magnitude faster than RAM.

In the perf recording above, storing back of ut_rnd_ulint_counter clears the cache for the other CPUs, and this is why the load instruction is slow. MySQL did have this fixed in 5.7.14 but reverted this fix in 5.7.20 (assuming some performance degradation in thread local storage). In MySQL 8.0+, ut_rnd_ulint_counter is a C++11 thread_local variable which has a faster implementation. MariaDB-10.3.5 avoided this by removing the random delay in InnoDB mutexes. Thread local variables reduce contention because each thread has its independent memory location. Because this is only a random number seed, there’s no need for synchronization of results.

Cache collisions

The impacts of ut_rnd_ulint_counter however aren’t limited to itself. Cache lines reserve blocks of memory according to the cache line size of the architecture (x86 – 64 bytes, arm64 and POWER – 128 bytes, s390 – 256 bytes). High in the CPU profile is the btr_cur_search_to_nth_level function. This is part of innodb’s scanning of an index and it would be easy to discount its high CPU usage. Looking at the disassembly however shows:

The lbz is a load byte instruction referring to btr_search_enabled. btr_search_enabled and is the MySQL server variable associated with the SQL global innodb_adaptive_hash_index . As a global system variable, this isn’t changed frequently, probably only once at startup. As such it should be able to rest comfortably in the cache of all CPUs in a read only cache line.

To find out why the relative address is examined in the mysql executable:

Taking the last two characters off the hexadecimal address 0000000011aa1b40 and the other variables in the same 256 (0x100) byte address range can be examined.

The ut_rnd_ulint_counter is stored 16 bytes away from the btr_search_enabled. Because of this, every invalidation of ut_rnd_ulint_counter cache line results in a cache invalidation of btr_search_enabled on POWER, and every other variable in the 0000000011aa1b00 to 0000000011aa1b40 address range for x86_64 (or to 0000000011aa1b80 for POWER and ARM64, or to 0000000011aa1c00 for s390). There are no rules governing the layout of these variables so it was only luck that caused x86_64 to not be affected here.

While the contended management of ut_rnd_ulint_counter remains an unsolved problem on MySQL-5.7, putting all the global variables into the same memory block as a way to keep them out of the same cache as other potentially frequently changed variables is a way to prevent unintended contention. Global variables are an ideal candidate for this as they are changed infrequently and are usually hot in code paths. By pulling all the global variables in the same location, this maximized the cache by using less cache lines that remain in a read only mode.

To achieve this co-location, MySQL uses the Linux kernel mechanism of using section attributes on variables and a linker script to bind their location. This was is described in MySQL bug 97777 and the MariaDB task MDEV-21145. The segmenting of the system global variables using this mechanism resulted in a 5.29% increase in the transactions per minute of the TPCCRunner benchmark (using MUTEXTYPE=sys).

Mutex Implementations

Having discovered what I thought to be a smoking gun with the ut_rnd_ulint_counter contention being the source of throughput problems for the benchmark, the thread_local implementation of MySQL-8.0 was back-ported to MySQL-5.7.28. Disappointingly it was discovered that the throughput was approximately the same. From a perf profile perspective, the CPU usage was no longer in the inlined ut_rnd_gen_ulint function, it was in the sync_array_wait_event and sync_array_reserve_cell functions.

These functions are largely wrappers around a pthread locking implementation. From the version history these were imported from MySQL-5.0 with minor modification in 2013 compared to the pthread implementation that receives significant maintenance from the glibc community represented by major CPU architecture manufacturers.

Thankfully, MySQL has a compile option -DMUTEXTYPE=sys that results in pthreads being used directly and that increased x86 performance marginally, but much more significantly on POWER (understandable as since the sync_array elements have multiple instances on the same cache line size of 128 bytes compared to x86_64 which is 64 bytes). I’ll soon get to benchmarking these changes in more detail and generate some bug report to get this default changed in distro packages at least.

Encode – Another example

Even while carrying out this investigation a MariaDB zulip chat exposed a benchmark of ENCODE (notably deprecated in MySQL-5.7) having scaling problems. Using the exact techniques here it was quick to generate and extracted a perf profile (MDEV-21285) and stack that showed every initial guess at the source of the problem – including mine – was incorrect. With the perf profile, however, the nature of the problem is quite clear – unlike the solution. That requires more thought.

Report/Show your perf recordings

Alongside its low overhead during recording, the useful aspect of perf from a DBA perspective is that perf stack traces show only the MySQL code being executed, and the frequency of its execution. There is no exposed database data, SQL queries, or table names in the output. However, the Perf Events and tool security (item 4) indicates that registers can be captured in a perf recording so be careful about sharing raw perf data.

Once the raw perf data is processed by perf report, with correct debug info and kernel, there are no addresses and only mysqld and kernel function names in its output. The most that is being exposing by sharing a perf report is the frequency of use of the MySQL code that was obtained externally. This should be enough to convince strict and competent managers and security people to sharing the perf recordings.

With some realistic expectations (code can’t execute in 0 time, all of the database can’t be in CPU cache) you should now be able to show the parts of MySQL that are limiting your queries.

Resulting bug reports

DescriptionMySQLMariaDB
separate global variables (from hot variables) using linker script (ELF) bug 97777 MDEV-21145 
comparative perf measurements of MUTEXTYPE=sys vs event (or futex)TODO MDEV-21452
buffer_pool page get counts contention (innodb_buffer_pool_read_requests status variable)bug 97822MDEV-21212

 

Disclaimer: The postings on this site are the authors own and don’t necessarily represent IBM’s positions, strategies or opinions.

The content in this blog is provided in good faith by members of the open source community. Percona has not edited or tested the technical content. Views expressed are the authors’ own. When using the advice from this or any other online resource test ideas before applying them to your production systems, and always secure a working back up.

Photo by Ricardo Gomez Angel on Unsplash

Leave a Reply