One thing I like to tell people is that you shouldn’t view Percona as a “software” company, but as a “solution” company. Our goal has always been to provide the best solution that meets each customer’s situation, rather than push our own software, regardless of whether it is the best fit or not. As a result, we have customers running all kinds of MySQL “flavors”: MySQL, MariaDB, Percona Server, Amazon RDS and Google Cloud SQL. We’re happy to help customers be successful with the technology of their choice, and advise them on alternatives when we see a better fit.
One area where I have been increasingly uneasy is our advanced high availability support with Percona XtraDB Cluster and other Galera-based technologies. In 2011, when we started working on Percona XtraDB Cluster together with Codership, we needed to find a way to arrange investment into the development of Galera technology to bring it to market. So we made a deal, which, while providing needed development resources, also required us to price Percona XtraDB Cluster support as a very expensive add-on option. While this made sense at the time, it also meant few companies could afford XtraDB Cluster support from Percona, especially at large scale.
As a few years passed, the Galera technology became the mainstream high-end high availability option. In addition to being available in Percona XtraDB Cluster, it has been included in MariaDB, as well as Galera Cluster for MySQL. Additionally, the alternative technology to solve the same problem – MySQL Group Replication – started to be developed by the MySQL Team at Oracle. With these all changes, it was impossible for us to provide affordable support for Percona XtraDB Cluster due to our previous commercial agreement with Codership that reflected a very different market situation than we now find ourselves facing.
As a result, over a year ago we exited our support partnership agreement with Codership and moved the support and development function in-house. These changes have proven to be positive for our customers, allowing us to better focus on their priorities and provide better response time for issues, as these no longer require partner escalation.
Today we’re taking the next natural step – we will no longer require customers to purchase Percona XtraDB Cluster as a separate add-on. Percona will include support for XtraDB Cluster and other Galera-based replication technologies in our Enterprise and Premier support levels, as well as our Percona Care and Managed Services subscriptions. Furthermore, we are going to support Oracle’s MySQL Group Replication technology at no additional cost too, once it becomes generally available, so our customers have access to the best high availability technology for their deployment.
As part of this change, you will also see us focusing on hardening XtraDB Cluster and Galera technology, making it better suited for demanding business workloads, as well as more secure and easier to use. All of our changes will be available as 100% open source solutions and will also be contributed back to the Galera development team to incorporate into their code base if they wish.
I believe making the Galera code better is the most appropriate action for us at this point!
In this post, we’ll examine why in an initial flushing analysis we find that Performance Schema data is incomplete.
Having shown the performance impact of Percona Server 5.7 patches, we can now discuss their technical reasoning and details. Let’s revisit the MySQL 5.7.11 performance schema synch wait graph from the previous post, for the case of unlimited InnoDB concurrency:
First of all, this graph is a little “nicer” than reality, which limits its diagnostic value. There are two reasons for this. The first one is that page cleaner worker threads are invisible to Performance Schema (see bug 79894). This alone limits PFS value in 5.7 if, for example, one tries to select only the events in the page cleaner threads or monitors low concurrency where the cleaner thread count is non-negligible part of the total threads.
To understand the second reason, let’s look into PMP for the same setting. Note that selected intermediate stack frames were removed for clarity, especially in the InnoDB mutex implementation.660 pthread_cond_wait,enter(ib0mutex.h:850),buf_dblwr_write_single_page(ib0mutex.h:850),buf_flush_write_block_low(buf0flu.cc:1096),buf_flush_page(buf0flu.cc:1096),buf_flush_single_page_from_LRU(buf0flu.cc:2217),buf_LRU_get_free_block(buf0lru.cc:1401),... 631 pthread_cond_wait,buf_dblwr_write_single_page(buf0dblwr.cc:1213),buf_flush_write_block_low(buf0flu.cc:1096),buf_flush_page(buf0flu.cc:1096),buf_flush_single_page_from_LRU(buf0flu.cc:2217),buf_LRU_get_free_block(buf0lru.cc:1401),... 337 pthread_cond_wait,PolicyMutex<TTASEventMutex<GenericPolicy>(ut0mutex.ic:89),get_next_redo_rseg(trx0trx.cc:1185),trx_assign_rseg_low(trx0trx.cc:1278),trx_set_rw_mode(trx0trx.cc:1278),lock_table(lock0lock.cc:4076),... 324 libaio::??(libaio.so.1),LinuxAIOHandler::collect(os0file.cc:2448),LinuxAIOHandler::poll(os0file.cc:2594),... 241 pthread_cond_wait,PolicyMutex<TTASEventMutex<GenericPolicy>(ut0mutex.ic:89),trx_write_serialisation_history(trx0trx.cc:1578),trx_commit_low(trx0trx.cc:2135),... 147 pthread_cond_wait,enter(ib0mutex.h:850),trx_undo_assign_undo(ib0mutex.h:850),trx_undo_report_row_operation(trx0rec.cc:1918),... 112 pthread_cod_wait,mtr_t::s_lock(sync0rw.ic:433),btr_cur_search_to_nth_level(btr0cur.cc:1008),... 83 poll(libc.so.6),Protocol_classic::get_command(protocol_classic.cc:965),do_command(sql_parse.cc:935),handle_connection(connection_handler_per_thread.cc:301),... 64 pthread_cond_wait,Per_thread_connection_handler::block_until_new_connection(thr_cond.h:136),...
The top wait in both PMP and the graph is the 660 samples of enter mutex in buf_dblwr_write_single_pages, which is the doublewrite mutex. Now try to find the nearly as hot 631 samples of event wait in buf_dblwr_write_single_page in the PFS output. You won’t find it because InnoDB OS event waits are not annotated in Performance Schema. In most cases this is correct, as OS event waits tend to be used when there is no work to do. The thread waits for work to appear, or for time to pass. But in the report above, the waiting thread is blocked from proceeding with useful work (see bug 80979).
Now that we’ve shown the two reasons why PFS data is not telling the whole server story, let’s take PMP data instead and consider how to proceed. Those top two PMP waits suggest 1) the server is performing a lot of single page flushes, and 2) those single page flushes have their concurrency limited by the eight doublewrite single-page flush slots available, and that the wait for a free slot to appear is significant.
Two options become apparent at this point: either make the single-page flush doublewrite more parallel or reduce the single-page flushing in the first place. We’re big fans of the latter option since version 5.6 performance work, where we configured Percona Server to not perform single-page flushes at all by introducing the innodb_empty_free_list_algorithm option, with the “backoff” default.
The next post in the series will describe how we removed single-page flushing in 5.7.
Correctly configuring the use of available memory resources is one of the most important things you have to get right with MySQL for optimal performance and stability. As of MySQL 5.7, the default configuration uses a very limited amount of memory – leaving defaults is one of the worst things you can do. But configuring it incorrectly can result in even worse performance (or even crashes).
The first rule of configuring MySQL memory usage is you never want your MySQL to cause the operating system to swap. Even minor swapping activity can dramatically reduce MySQL performance. Note the keyword “activity” here. It is fine to have some used space in your swap file, as there are probably parts of the operating system that are unused when MySQL is running, and it’s a good idea to swap them out. What you don’t want is constant swapping going on during the operation, which is easily seen in the “si” and “so” columns in the vmstat output.
Example: No Significant Swapping
Example: Heavy Swapping Going
If you’re running Percona Monitoring and Management, you can also look into the Swap Activity graph in System Overview Dashboard.
If you have spikes of more than 1MB/sec, or constant swap activity, you might need to revisit your memory configuration.
MySQL Memory allocation is complicated. There are global buffers, per-connection buffers (which depend on the workload), and some uncontrolled memory allocations (i.e., inside Stored Procedures), all contributing to difficulties in computing how much memory MySQL will really use for your workload. It is better to check it by looking at the virtual memory size (VSZ) that MySQL uses. You can get it from “top”, or by running ps aux | grep mysqld.mysql 3939 30.3 53.4 11635184 8748364 ? Sl Apr08 9106:41 /usr/sbin/mysqld
The 5th column here shows VSZ usage (about 11GB).
Note that the VSZ is likely to change over time. It is often a good idea to plot it in your monitoring system and set an alert to ping you when it hits a specified threshold. Don’t allow the mysqld process VSZ exceed 90% of the system memory (and less if you’re running more than just MySQL on the system).
It’s a good idea to start on the safe side by conservatively setting your global and per connections buffers, and then increase them as you go. Many can be set online, including innodb_buffer_pool_size in MySQL 5.7.
So how do you decide how much memory to allocate to MySQL versus everything else? In most cases you shouldn’t commit more than 90% of your physical memory to MySQL, as you need to have some reserved for the operating system and things like caching binary log files, temporary sort files, etc.
There are cases when MySQL should use significantly less than 90% of memory:
Once you know how much memory you want the MySQL process to have as a whole, you’ll need to think about for what purpose the memory should be used inside MySQL.The first part of memory usage in MySQL is workload related – if you have many connections active at the same time that run heavy selects using a lot of memory for sorting or temporary tables, you might need a lot of memory (especially if Performance Schema is enabled). In other cases this amount of memory is minimal. You’ll generally need somewhere between 1 and 10GB for this purpose.
Another thing you need to account for is memory fragmentation. Depending on the memory allocation library you’re using (glibc, TCMalloc, jemalloc, etc.), the operating system settings such as Transparent Huge Pages (THP) and workload may show memory usage to grow over time (until it reaches some steady state). Memory fragmentation can also account for 10% or more of additional memory usage.
Finally, let’s think about various global buffers and caching. In typical cases, you mainly only have innodb_buffer_pool_size to worry about. But you might also need to consider key_buffer_size, tokudb_cache_size, query_cache_size as well as table_cache and table_open_cache. These are also responsible for global memory allocation, even though they are not counted in bytes. Performance _Schema may also take a lot of memory, especially if you have a large number of connections or tables in the system.
When you specify the size of the buffers and caches, you should determine what you’re specifying. For innodb_buffer_pool_size, remember there is another 5-10% of memory that is allocated for additional data structures – and this number is larger if you’re using compression or set innodb_page_size smaller than 16K. For tokudb_cache_size, it’s important to remember that the setting specified is a guide, not a “hard” limit: the cache size can actually grow slightly larger than the specified limit.
For systems with large amounts of memory, the database cache is going to be by far the largest memory consumer, and you’re going to allocate most of your memory to it. When you add extra memory to the system, it is typically to increase the database cache size.
Let’s do some math for a specific example. Assume you have a system (physical or virtual) with 16GB of memory. We are only running MySQL on this system, with an InnoDB storage engine and use innodb_flush_method=O_DIRECT, so we can allocate 90% (or 14.4GB) of memory to MySQL. For our workload, we assume connection handling and other MySQL connection-based overhead will take up 1GB (leaving 13.4GB). 0.4GB is likely to be consumed by various other global buffers (innodb_log_buffer_size, Table Caches, other miscellaneous needs, etc.), which now leaves 13GB. Considering the 5-7% overhead that the InnodB Buffer Pool has, a sensible setting is innodb_buffer_pool_size=12G – what we very commonly see working well for systems with 16GB of memory.
Now that we have configured MySQL memory usage, we also should look at the OS configuration. The first question to ask is if we don’t want MySQL to swap, should we even have the swap file enabled? In most cases, the answer is yes – you want to have the swap file enabled (strive for 4GB minimum, and no less than 25% of memory installed) for two reasons:
As we only want the swap file used in emergencies, such as when there is no memory available or to swap out idle processes, we want to reduce Operating System tendency to swap (echo 1 > /proc/sys/vm/swappiness). Without this configuration setting you might find the OS swapping out portions of MySQL just because it feels it needs to increase the amount of available file cache (which is almost always a wrong choice for MySQL).
The next thing when it comes to OS configuration is setting the Out Of Memory killer. You may have seen message like this in your kernel log file:
Apr 24 02:43:18 db01 kernel: Out of memory: Kill process 22211 (mysqld) score 986 or sacrifice child
When MySQL itself is at fault, it’s pretty rational thing to do. However, it’s also possible the real problem was some of the batch activities you’re running: scripts, backups, etc. In this case, you probably want those processes to be terminated if the system does not have enough memory rather than MySQL.
To make MySQL a less likely candidate to be killed by the OOM killer, you can adjust the behavior to make MySQL less preferable with the following:
echo '-800' > /proc/$(pidof mysqld)/oom_score_adj
This will make the Linux kernel prefer killing other heavy memory consumers first.
Finally on a system with more than one CPU socket, you should care about NUMA when it comes to MySQL memory allocation. In newer MySQL versions, you want to enable innodb_numa_interleave=1. In older versions you can either manually run numactl --interleave=all before you start MySQL server, or use the numa_interleave configuration option in Percona Server.
For general inquiries, please send us your question and someone will contact you.