In November 2017, Percona released a fix for lp:1704195 (migrated to PS-2503), created by Laurynas Biveinis. The fix, included with Percona Server for MySQL since versions 5.6.38-83.0 and 5.7.20-18, stops ANALYZE TABLE from invalidating query and table definition cache content for supported storage engines (InnoDB, TokuDB and MyRocks).
Why is this important?
In short, it is now safe to run ANALYZE TABLE in production environments because it won’t trigger a situation where all queries on the same table stack are in the state "Waiting for table flush". Check this blog post for details on how this situation can happen.
Why do we need to run ANALYZE TABLE?
When Optimizer decides which index to choose to resolve the query, it uses statistics stored for this table by storage engine. If the statistics are not up to date, Optimizer might choose the wrong index when it creates the query execution plan. This can cause performance to suffer.
To prevent this, storage engines support automatic and manual statistics updates. While automatic statistics updates usually work fine, there are cases when they do not do their job properly.
For example, InnoDB uses 20 sample 16K pages when it updates persistent statistics, and eight 16K pages when it updates transient statistics. If your data distribution is even, it does not matter how big your table is: even for 1T tables, using a sample of 320K is enough. But if your data isn’t even, statistics might get wrongly created. The solution for this issue is to increase either the innodb_stats_transient_sample_pages or innodb_stats_persistent_sample_pages variable. But increasing the number of pages to examine while collecting statistics leads to longer update runs, and thus higher IO activity, which is probably not what you want to happen often.
To control this, you can disable automatic statistics updates for such tables, and schedule a job that periodically runs ANALYZE TABLE.
Theoretically yes, but we could easily hit a situation as described in this blog post by Miguel Angel Nieto. The article describes what if some long-running query starts and doesn’t finish before ANALYZE TABLE. All the queries on the analyzing table get stuck in the state "Waiting for table flush" at some time.
This happens because before the fix, ANALYZE TABLE worked as follows:
- Opens table statistics: concurrent DML operations ( INSERT/UPDATE/DELETE/SELECT) are allowed
- Updates table statistics: concurrent DML operations are allowed
- Update finished
- Invalidates table entry in the table definition cache: concurrent DML operations are forbidden
- What happens here is ANALYZE TABLE marks the currently open table share instances as invalid. This does not affect running queries: they will complete as usual. But all incoming queries will not start until they can re-open table share instance. And this will not happen until all currently running queries complete.
- Invalidates query cache: concurrent DML operations are forbidden
Last two operations are usually fast, but they cannot finish if another query touched either the table share instance or acquired query cache mutex. And, in its turn, it cannot allow for incoming queries to start.
However ANALYZE TABLE modifies table statistics, not table definition!
Practically, it cannot affect already running queries in any way. If a query started before ANALYZE TABLE finished updating statistics, it uses old statistics. ANALYZE TABLE does not affect data in the table. Thus old entries in the query cache will still be correct. It hasn’t changed the definition of the table. Therefore there is no need to remove it from the table definition cache. As a result, we avoid operations 4 and 5 above.