In this post, I’ll discuss the fix for lp:1704195 (migrated to PS-2503), which prevents
ANALYZE TABLE from blocking all subsequent queries on the same table.
In November 2017, Percona released a fix for lp:1704195 (migrated to PS-2503), created by Laurynas Biveinis. The fix, included with Percona Server for MySQL since versions 5.6.38-83.0 and 5.7.20-18, stops ANALYZE TABLE from invalidating query and table definition cache content for supported storage engines (InnoDB, TokuDB and MyRocks).
In short, it is now safe to run ANALYZE TABLE in production environments because it won’t trigger a situation where all queries on the same table stack are in the state "Waiting for table flush". Check this blog post for details on how this situation can happen.
When Optimizer decides which index to choose to resolve the query, it uses statistics stored for this table by storage engine. If the statistics are not up to date, Optimizer might choose the wrong index when it creates the query execution plan. This can cause performance to suffer.
To prevent this, storage engines support automatic and manual statistics updates. While automatic statistics updates usually work fine, there are cases when they do not do their job properly.
For example, InnoDB uses 20 sample 16K pages when it updates persistent statistics, and eight 16K pages when it updates transient statistics. If your data distribution is even, it does not matter how big your table is: even for 1T tables, using a sample of 320K is enough. But if your data isn’t even, statistics might get wrongly created. The solution for this issue is to increase either the innodb_stats_transient_sample_pages or innodb_stats_persistent_sample_pages variable. But increasing the number of pages to examine while collecting statistics leads to longer update runs, and thus higher IO activity, which is probably not what you want to happen often.
To control this, you can disable automatic statistics updates for such tables, and schedule a job that periodically runs ANALYZE TABLE.
Will it be safe before the fix for lp:1704195 (migrated to PS-2503)?
Theoretically yes, but we could easily hit a situation as described in this blog post by Miguel Angel Nieto. The article describes what if some long-running query starts and doesn’t finish before ANALYZE TABLE. All the queries on the analyzing table get stuck in the state "Waiting for table flush" at some time.
This happens because before the fix, ANALYZE TABLE worked as follows:
Last two operations are usually fast, but they cannot finish if another query touched either the table share instance or acquired query cache mutex. And, in its turn, it cannot allow for incoming queries to start.
However ANALYZE TABLE modifies table statistics, not table definition!
Practically, it cannot affect already running queries in any way. If a query started before ANALYZE TABLE finished updating statistics, it uses old statistics. ANALYZE TABLE does not affect data in the table. Thus old entries in the query cache will still be correct. It hasn’t changed the definition of the table. Therefore there is no need to remove it from the table definition cache. As a result, we avoid operations 4 and 5 above.
The fix for lp:1704195 (migrated to PS-2503) removes these additional updates and locks required for them, and makes ANALYZE TABLE always safe to run in busy production environments.