pt-pmp is a profiler tool that creates and summarizes full stack traces of processes on Linux. It was inspired by http://poormansprofiler.org and helped Percona Support resolve many performance issues. In this blog post, I will present an improved pt-pmp that can collect stack traces with minimal impact on the production environment.
TLDR; Starting from Percona Toolkit 3.6.0, use option –dumper set to eu (eu-stack) or pteu (pt-eustack-resolver) instead of default dumper gdb. This increases pt-pmp performance around seven times for dumper eu and 65 times for dumper pteu with no information loss. You need to install elfutils to explore these options.
If you need summary only for specific threads, pass their numbers to option –tids (regular expressions supported).
In high concurrent environments, when you notice performance issues, stack traces that pt-pmp summarize can help you quickly find the solution.
For example, in the listing below, we clearly see that 101 threads are waiting for a metadata lock (MDL_wait::timed_wait(mdl.cc:1837)) that these threads try to acquire when open tables for the query (open_tables_for_query(sql_base.cc:6824)):
1 2 3 4 5 6 7 8 |
101 __futex_abstimed_wait_common64,__futex_abstimed_wait_common,__GI___futex_abstimed_wait_cancelable64,__pthread_cond_wait_common,___pthread_cond_timedwait64, native_cond_timedwait(thr_cond.h:99),my_cond_timedwait(thr_cond.h:148),inline_mysql_cond_timedwait(mysql_cond.h:242),MDL_wait::timed_wait(mdl.cc:1837), MDL_context::acquire_lock(mdl.cc:3540),MDL_context::acquire_lock(mdl.cc:3377),open_table_get_mdl_lock(sql_base.h:530),open_table(sql_base.cc:3108), open_and_process_table(sql_base.cc:5122),open_tables(sql_base.cc:5942),open_tables_for_query(sql_base.cc:6824),Sql_cmd_dml::execute(sql_select.cc:532), mysql_execute_command(sql_parse.cc:4944),Prepared_statement::execute(sql_prepare.cc:3646),Prepared_statement::execute_loop(sql_prepare.cc:3058), mysqld_stmt_execute(sql_prepare.cc:1914),dispatch_command(sql_parse.cc:2024),do_command(sql_parse.cc:1501),handle_connection(connection_handler_per_thread.cc:308), pfs_spawn_thread(pfs.cc:2987),start_thread,clone3 … |
Once the reason for the stall is found, DBA can consider ways to recover from the situation.
However, this ease of use is not free. pt-pmp works by attaching gdb to the database server process and running the backtrace (bt) command for all running threads. This leads to consequences that Baron Schwartz described in his blog post in the year 2011:
- The server freezes for the duration of the process. This is the most obvious impact on the running server: GDB forklifts the process and gets a stack trace from every thread, then lets it go on working. But this can take a while. It’s usually a couple of seconds, but on big servers with a lot of memory and many threads, it can take much longer (I’ve seen tens of seconds but heard reports of minutes).
- The server can crash. I haven’t seen this myself, but I’ve heard reports from others.
- The server can be left in an unstable state. Sometimes, when GDB detaches, the process’s state isn’t quite as it was before. I have rarely seen this, but the other day I had a customer experience a very slow-running server that was using tons of CPU time and lots of system CPU, and exhibiting the classic signs of InnoDB kernel_mutex contention. I was not able to find anything wrong, and was still trying to determine whether the sudden slowness was due to some cause such as an increase in traffic or just InnoDB having trouble, when the customer Skyped me to say that they’d restarted the server and it resolved the problems. This server had previously been analyzed with GDB around the time the problems began.
This is still true in 2024. That is why we added performance improvements for pt-pmp in Percona Toolkit 3.6.0.
eu-stack support
eu-stack is a utility from the elfutils package. eu-stack prints a stack for each thread in a process but, unlike gdb, does it with minimal impact.
pt-pmp now allows users to choose if they want to obtain stack traces with gdb (default) or with eu-stack. Additionally, Percona Toolkit now includes the utility pt-eustack-resolver that collects stack traces even faster. To collect summaries with eu-stack you need to set option –dumper (short form: -d) to eu (eu-stack) or pteu (pt-eustack-resolver).
eu-stack summaries have the same information as gdb summaries, so you won’t lose any data.
Let’s compare by example:
gdb:
1 2 3 4 5 6 7 8 9 10 |
$ time ./bin/pt-pmp --pid `cat /tmp/12345/data/mysql_sandbox12345.pid` -d gdb Cum 19 Oca 2024 16:58:34 +03 29 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory. 100 __futex_abstimed_wait_common64,__futex_abstimed_wait_common,__GI___futex_abstimed_wait_cancelable64,__pthread_cond_wait_common,___pthread_cond_timedwait64,native_cond_timedwait(thr_cond.h:99),my_cond_timedwait(thr_cond.h:148),inline_mysql_cond_timedwait(mysql_cond.h:242),MDL_wait::timed_wait(mdl.cc:1837),MDL_context::acquire_lock(mdl.cc:3540),MDL_context::acquire_lock(mdl.cc:3377),open_table(sql_base.h:530),open_and_process_table(sql_base.cc:5122),open_tables(sql_base.cc:5942),open_tables_for_query(sql_base.cc:6824),Sql_cmd_dml::execute(sql_select.cc:532),mysql_execute_command(sql_parse.cc:4944),Prepared_statement::execute(sql_prepare.cc:3646),Prepared_statement::execute_loop(sql_prepare.cc:3058),mysqld_stmt_execute(sql_prepare.cc:1914),dispatch_command(sql_parse.cc:2024),do_command(sql_parse.cc:1501),handle_connection(connection_handler_per_thread.cc:308),pfs_spawn_thread(pfs.cc:2987),start_thread,clone3 … 1 epoll_wait,epoll_dispatch(epoll.c:465),event_base_loop(event.c:1998),ngs::Server::run_task(server.cc:83),operator(server.cc:148),std::__invoke_impl<void,(invoke.h:61),std::__invoke_r<void,(invoke.h:111),std::_Function_handler<void(),(std_function.h:290),std::function::operator(std_function.h:290),ngs::Scheduler_dynamic::worker(scheduler.cc:234),pfs_spawn_thread(pfs.cc:2987),start_thread,clone3 real 0m20,752s user 0m20,767s sys 0m0,504s |
eu-stack:
1 2 3 4 5 6 7 8 9 10 11 |
$ time ./bin/pt-pmp --pid `cat /tmp/12345/data/mysql_sandbox12345.pid` -d eu Cum 19 Oca 2024 17:00:14 +03 100 __GI___futex_abstimed_wait_cancelable64,pthread_cond_timedwait,MDL_wait::timed_wait(thr_cond.h:99:32),MDL_context::acquire_lock(mdl.cc:3540:38),open_table(sql_base.cc:3091:50),open_tables(sql_base.cc:5122:25),open_tables_for_query(sql_base.cc:6824:18),Sql_cmd_dml::execute(sql_select.cc:532:30),mysql_execute_command(sql_parse.cc:4944:36),Prepared_statement::execute(sql_prepare.cc:3646:33),Prepared_statement::execute_loop(sql_prepare.cc:3058:18),mysqld_stmt_execute(sql_prepare.cc:1914:23),dispatch_command(sql_parse.cc:2024:28),do_command(sql_parse.cc:1501:34),handle_connection(connection_handler_per_thread.cc:308:23),pfs_spawn_thread(pfs.cc:2987:24),start_thread,__clone3 … 1 epoll_wait,epoll_dispatch(epoll.c:465:8),event_base_loop(event.c:1998:9),ngs::Server::run_task(server.cc:83:18),std::_Function_handler<(server.cc:148:57),ngs::Scheduler_dynamic::worker(std_function.h:590:9),pfs_spawn_thread(pfs.cc:2987:24),start_thread,__clone3 1 clock_nanosleep@GLIBC_2.2.5,__nanosleep,srv_master_thread(this_thread_sleep.h:82:20),void(invoke.h:61:36),??(libstdc++.so.6.0.30),start_thread,__clone3 1 clock_nanosleep@GLIBC_2.2.5,__nanosleep,buf_lru_manager_thread(this_thread_sleep.h:82:20),std::thread::_State_impl<std::thread::_Invoker<std::tuple<Detached_thread,(invoke.h:61:36),??(libstdc++.so.6.0.30),start_thread,__clone3 real 0m3,054s user 0m3,029s sys 0m0,073s |
pt-eustack-resolver:
1 2 3 4 5 6 7 8 9 |
$ time ./bin/pt-pmp --pid `cat /tmp/12345/data/mysql_sandbox12345.pid` -d pteu Cum 19 Oca 2024 17:01:04 +03 100 __futex_abstimed_wait_common64,__pthread_cond_wait_common,native_cond_timedwait,MDL_context::acquire_lock,MDL_context::set_force_dml_deadlock_weight,open_and_process_table,open_tables_for_query,Sql_cmd_dml::execute,mysql_execute_command,Prepared_statement::execute,Prepared_statement::execute_loop,mysqld_stmt_execute,dispatch_command,do_command,handle_connection,pfs_spawn_thread,start_thread,__clone3 … 1 epoll_wait,epoll_dispatch,event_base_loop,std::atomic::load,std::__shared_count<,std::_Function_base::~_Function_base,pfs_spawn_thread,start_thread,__clone3 real 0m0,365s user 0m0,320s sys 0m0,073s |
You see that pt-eustack-resolver runs significantly faster but removes source code coordinates. Otherwise, stack traces are not cropped.
–readnever support for gdb
Another performance improvement for pt-pmp is support of the option –readnever for gdb. This option tells gdb to don’t read debug symbol files when producing summaries. It runs way faster than gdb with default options but may produce unusable summaries if run for the release server:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
$ time ./bin/pt-pmp --pid `cat /tmp/12345/data/mysql_sandbox12345.pid` -d gdb --readnever Cum 19 Oca 2024 17:04:30 +03 113 __futex_abstimed_wait_cancelable64(libc.so.6),::??,::??,::??,::??,?? 10 syscall(libc.so.6),io_getevents(libaio.so.1),?? 6 __futex_abstimed_wait_cancelable64(libc.so.6),::??,::??,::??,?? 4 __futex_abstimed_wait_cancelable64(libc.so.6),?? 2 ppoll(libc.so.6),?? 2 clock_nanosleep(libc.so.6),?? 1 sigtimedwait(libc.so.6),::??,?? 1 poll(libc.so.6),::??,::??,::??,::??,::??,Mysqld_socket_listener::listen_for_connection_event,::??,::??,::??,::??,::??,::??,::??,::??,Per_thread_connection_handler::add_connection 1 __futex_abstimed_wait_cancelable64(libc.so.6),vtable,::??,COND_compress_gtid_table,?? 1 __futex_abstimed_wait_cancelable64(libc.so.6),::??,::??,mysql_bin_log,::??,?? 1 epoll_wait(libc.so.6),::??,::??,::??,::??,?? real 0m0,560s user 0m0,900s sys 0m0,230s |
While you can get traces similar to the below example if use a debug version of the database server, we recommend using this option only in cases when all other options cause noticeable performance impact.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
$ time ./bin/pt-pmp --pid `cat /home/sveta/build/ps-8.0/mysql-test/var/run/mysqld.1.pid` -d gdb --readnever Cum 19 Oca 2024 17:07:07 +03 48 __futex_abstimed_wait_cancelable64(libc.so.6),::??,::??,::??,?? 45 __futex_abstimed_wait_cancelable64(libc.so.6),::??,::??,Commit_stage_manager::get_instance::shared_instance,?? 11 __futex_abstimed_wait_cancelable64(libc.so.6),::??,::??,::??,::??,?? 4 __lll_lock_wait(libc.so.6),__pthread_mutex_cond_lock(libc.so.6),?? 3 __futex_abstimed_wait_cancelable64(libc.so.6),?? 2 ut_delay,TTASEventMutex::is_free,TTASEventMutex::spin_and_try_lock,TTASEventMutex::enter,PolicyMutex::enter,mutex_enter_inline<PolicyMutex<TTASEventMutex<GenericPolicy>>>,buf_page_init_for_read,buf_read_page_low,buf_read_page,Buf_fetch::read_page,Buf_fetch_normal::get,Buf_fetch::single_page,buf_page_get_gen,buf_page_get,trx_undo_page_get,trx_rseg_t::validate_curr_size,trx_rseg_t::latch,trx_undo_assign_undo,trx_undo_report_row_operation,btr_cur_upd_lock_and_undo,btr_cur_update_in_place,btr_cur_optimistic_update,row_upd_clust_rec,row_upd_clust_step,row_upd,row_upd_step,row_update_for_mysql_using_upd_graph,row_update_for_mysql,ha_innobase::update_row,handler::ha_update_row,Sql_cmd_update::update_single_table,Sql_cmd_update::execute_inner,Sql_cmd_dml::execute,mysql_execute_command,Prepared_statement::execute,Prepared_statement::execute_loop,mysqld_stmt_execute,dispatch_command,do_command,handle_connection,pfs_spawn_thread,start_thread(libc.so.6),::??,::??,::??,::??,::??,libc::??(libc.so.6),::??,::??,::??,?? 1 ut_delay,TTASEventMutex::is_free,TTASEventMutex::spin_and_try_lock,TTASEventMutex::enter,PolicyMutex::enter,mutex_enter_inline<PolicyMutex<TTASEventMutex<GenericPolicy>>>,trx_sys_mutex_enter,trx_set_rw_mode,lock_table,row_search_mvcc,ha_innobase::index_read,handler::index_read_map,handler::ha_index_read_map,handler::read_range_first,ha_innobase::read_range_first,handler::multi_range_read_next,DsMrr_impl::dsmrr_next,ha_innobase::multi_range_read_next,handler::ha_multi_range_read_next,IndexRangeScanIterator::Read,Sql_cmd_update::update_single_table,Sql_cmd_update::execute_inner,Sql_cmd_dml::execute,mysql_execute_command,Prepared_statement::execute,Prepared_statement::execute_loop,mysqld_stmt_execute,dispatch_command,do_command,handle_connection,pfs_spawn_thread,start_thread(libc.so.6),::??,::??,::??,::??,::??,libc::??(libc.so.6),::??,::??,::??,?? … 1 epoll_wait(libc.so.6),::??,::??,::??,::??,::??,epoll_dispatch,event_base_loop,ngs::Socket_events::loop,ngs::Socket_acceptors_task::loop,ngs::Server::run_task,ngs::Server::start_tasks::{lambda#1}::operator,std::__invoke_impl<,ngs::Server::start_tasks::{lambda#1}&>,std::enable_if::typestd::__invoke_r,std::_Function_handler::_M_invoke,std::function::operator,ngs::Scheduler_dynamic::worker,ngs::Scheduler_dynamic::worker_proxy,pfs_spawn_thread,start_thread(libc.so.6),::??,::??,::??,::??,?? 1 clock_nanosleep(libc.so.6),::??,::??,::??,::??,::??,std::chrono::durationstd::chrono::__duration_cast_impl::__cast,srv_master_sleep,srv_master_main_loop,srv_master_thread,std::__invoke_impl<,>,std::__invoke_result::typestd::__invoke,std::_Bind::__call,std::_Bind::operator,Detached_thread::operator<>,std::__invoke_impl<,Detached_thread,>,std::__invoke_result::typestd::__invoke,std::thread::_Invoker::_M_invoke,std::thread::_Invoker::operator,std::thread::_State_impl::_M_run,libstdc++::??(libstdc++.so.6),::??,::??,?? 1 clock_nanosleep(libc.so.6),::??,::??,::??,::??,::??,std::chrono::durationstd::chrono::__duration_cast_impl::__cast,buf_lru_manager_sleep_if_needed,buf_lru_manager_thread,std::__invoke_impl<,,unsignedlong&>,std::__invoke_result::typestd::__invoke,std::_Bind::__call,std::_Bind::operator,Detached_thread::operator<,unsignedlong>,std::__invoke_impl<,Detached_thread,,unsignedlong>,std::__invoke_result::typestd::__invoke,std::thread::_Invoker::_M_invoke,std::thread::_Invoker::operator,std::thread::_State_impl::_M_run,libstdc++::??(libstdc++.so.6),?? real 0m1,790s user 0m3,619s sys 0m0,500s |
Thread-specific traces
pt-pmp now supports option –tids (short form: -t). If you pass regular expressions, separated by commas, as argument for this option, pt-pmp will only print traces for threads, matching those expressions.
For example,
pt-pmp -t 21201,23846 will print summaries only for threads 21201,23846
pt-pmp -t ^25 will print summaries only for threads; those numbers start from 25
pt-pmp -t 21201,237.8 will print summaries only for thread 21201, and threads those numbers match regular expression 237.8, e.g. 23708 or 23758
You will find examples of the original stack trace, full summary, and summary for specified threads only in the Percona Toolkit test suite:
- https://github.com/percona/percona-toolkit/blob/3.x/t/pt-pmp/samples/stacktrace001.in
- https://github.com/percona/percona-toolkit/blob/3.x/t/pt-pmp/samples/stacktrace001.out
- https://github.com/percona/percona-toolkit/blob/3.x/t/pt-pmp/samples/stacktrace001_t25.out
- https://github.com/percona/percona-toolkit/blob/3.x/t/pt-pmp/samples/stacktrace001_t21201_237_8.out
- https://github.com/percona/percona-toolkit/blob/3.x/t/pt-pmp/samples/stacktrace001_t21201_23846.out
Option –tids takes effect when pt-pmp produces a summary, therefore you can use it on saved samples, created by any version of pt-pmp. This could be handy for Support engineers who work on data, sent by their customers.
Acknowledgments
These improvements are based on the work by Alexey Stroganov, originally published in Percona Labs. We intentionally removed quickstack support from the original patch because it does not produce useful stack traces with MySQL 8.0 and made gdb parameter –readnever optional.
Summary
Collecting stack traces and summarizing them with eu-stack is much faster than with default dumper gdb. Install elfutils and use pt-pmp with option –dumper set either to eu or pteu unless the use of gdb is absolutely necessary.
Filter examined threads with option –tids if needed.