Avoiding MySQL ALTER table downtimeAndrew Moore
MySQL table alterations can interrupt production traffic causing bad customer experience or in worst cases, loss of revenue. Not all DBAs, developers, syadmins know MySQL well enough to avoid this pitfall. DBAs usually encounter these kinds of production interruptions when working with upgrade scripts that touch both application and database or if an inexperienced admin/dev engineer perform the schema change without knowing how MySQL operates internally.
* Direct MySQL ALTER table locks for duration of change (pre-5.6)
* Online DDL in MySQL 5.6 is not always online and may incurr locks
* Even with Percona Toolkit‘s pt-online-schema-change there are several workloads that can experience blocking
Here on the Percona MySQL Managed Services team we encourage our clients to work with us when planning and performing schema migrations. We aim to ensure that we are using the best method available in their given circumstance. Our intentions to avoid blocking when performing DDL on large tables ensures that business can continue as usual whilst we strive to improve response time or add application functionality. The bottom line is that a business relying on access to its data cannot afford to be down during core trading hours.
Many of the installations we manage are still below MySQL 5.6, which requires us to seek workarounds to minimize the amount of disruption a migration can cause. This may entail slave promotion or changing the schema with an ‘online schema change’ tool. MySQL version 5.6 looks to address this issue by reducing the number of scenarios where a table is rebuilt and locked but it doesn’t yet cover all eventualities, for example when changing the data type of a column a full table rebuild is necessary. The topic of 5.6 Online Schema Change was discussed in great detail last year in the post, “Schema changes – what’s new in MySQL 5.6?” by Przemysław Malkowski
With new functionality arriving in MySQL 5.7, we look forward to non-blocking DDL operations such as; OPTIMIZE TABLE and RENAME INDEX. (More info)
The best advice for MySQL 5.6 users is to review the matrix to familiarize with situations where it might be best to look outside of MySQL to perform schema changes, the good news is that we’re on the right path to solving this natively.
Truth be told, a blocking alter is usually going to go unnoticed on a 30MB table and we tend to use a direct alter in this situation, but on a 30GB or 300GB table we have some planning to do. If there is a period of time where activity is low and the this is permissive of locking the table then sometimes it is better execute within this window. Frequently though we are reactive to new SQL statements or a new performance issue and an emergency index is required to reduce load on the master in order to improve the response time.
To pt-osc or not to pt-osc?
As mentioned, pt-online-schema-change is a fixture in our workflow. It’s usually the right way to go but we still have occasions where pt-online-schema-change cannot be used, for example; when a table already uses triggers. It’s an important to remind ourselves of the the steps that pt-online-schema-change traverses to complete it’s job. Lets look at the source code to identify these;
[moore@localhost]$ egrep 'Step' pt-online-schema-change
# Step 1: Create the new table.
# Step 2: Alter the new, empty table. This should be very quick,
# Step 3: Create the triggers to capture changes on the original table and <--(metadata lock)
# Step 4: Copy rows.
# Step 5: Rename tables: orig -> old, new -> orig <--(metadata lock)
# Step 6: Update foreign key constraints if there are child tables.
# Step 7: Drop the old table.
I pick out steps 3 and 5 from above to highlight a source of a source of potential downtime due to locks, but step 6 is also an area for concern since foreign keys can have nested actions and should be considered when planning these actions to avoid related tables from being rebuilt with a direct alter implicitly. There are several ways to approach a table with referential integrity constraints and they are detailed within the pt-osc documentation a good preparation step is to review the structure of your table including the constraints and how the ripples of the change can affect the tables around it.
Recently we were alerted to an incident after a client with a highly concurrent and highly transactional workload ran a standard pt-online-schema-change script over a large table. This appeared normal to them and a few hours later our pager man was notified that this client was experiencing max_connections limit reached. So what was going on? When pt-online-schema-change reached step 5 it tried to acquire a metadata lock to rename the the original and the shadow table, however this wasn’t immediately granted due to open transactions and thus threads began to queue behind the RENAME command. The actual effect this had on the client’s application was downtime. No new connections could be made and all existing threads were waiting behind the RENAME command.
Introduced in 5.5.3 at server level. When a transaction starts it will acquire a metadata lock (independent of storage engine) on all tables it uses and then releases them when it’s finished it’s work. This ensures that nothing can alter the table definition whilst a transaction is open.
With some foresight and planning we can avoid these situations with non-default pt-osc options, namely –nodrop-new-table and –no-swap-tables. This combination leaves both the shadow table and the triggers inplace so that we can instigate an atomic RENAME when load permits.
EDIT: as of percona-toolkit version 2.2 we have a new variable –tries which in conjunction with –set-vars has been deployed to cover this scenario where various pt-osc operations could block waiting for a metadata lock. The default behaviour of pt-osc (–set-vars) is to set the following session variables when it connects to the server;
when using –tries we can granularly identify the operation, try count and the wait interval between tries. This combination will ensure that pt-osc will kill it’s own waiting session in good time to avoid the thread pileup and provide us with a loop to attempt to acquire our metadata lock for triggers|rename|fk management;
This illustrates that even with a tool like pt-online-schema-change it is important to understand the caveats presented with the solution you think is most adequate. To help decide the direction to take use the flow chart to ensure you’re taking into account some of the caveats of the MySQL schema change. Be sure to read up on the recommended outcome though as there are uncharted areas such as disk space, IO load that are not featured on the diagram.
Ensure you know what effect ALTER TABLE will have on your platform and pick the right method to suit your uptime. Sometimes that means delaying the change until a period of lighter use or utilising a tool that will avoid holding a table locked for the duration of the operation. A direct ALTER is sometimes the answer like when you have triggers installed on a table.
– In most cases pt-osc is exactly what we need
– In many cases pt-osc is needed but the way in which it’s used needs tweaking
– In few cases pt-osc isn’t the right tool/method and we need to consider native blocking ALTER or using failovers to juggle the change into place on all hosts in the replica cluster.
If you want to learn more about avoiding avoidable downtime please tune into my webinar Wednesday, November 19 at 10 a.m. PST. It’s titled “Tips from the Trenches: A Guide to Preventing Downtime for the Over-Extended DBA.” Register now! (If you miss it, don’t worry: Catch the recording and download the slides from that same registration page.)