Downloads

Blog

Avoiding MySQL ALTER table downtime

November 18, 2014

Author

Share this Post:

MySQL table alterations can interrupt production traffic causing bad customer experience or in worst cases, loss of revenue. Not all DBAs, developers, and syadmins know MySQL well enough to avoid this pitfall. DBAs usually encounter these kinds of production interruptions when working with upgrade scripts that touch both application and database or if an inexperienced admin/dev engineer perform the schema change without knowing how MySQL operates internally.

Truths
* Direct MySQL ALTER table locks for the duration of change (pre-5.6)
* Online DDL in MySQL 5.6 is not always online and may incur locks
* Even with Percona Toolkit‘s pt-online-schema-change there are several workloads that can experience blocking

Here on the Percona MySQL Managed Services team, we encourage our clients to work with us when planning and performing schema migrations. We aim to ensure that we are using the best method available in their given circumstance. Our intentions to avoid blocking when performing DDL on large tables ensures that business can continue as usual whilst we strive to improve response time or add application functionality. The bottom line is that a business relying on access to its data cannot afford to be down during core trading hours.

Many of the installations we manage are still below MySQL 5.6, which requires us to seek workarounds to minimize the amount of disruption a migration can cause. This may entail slave promotion or changing the schema with an ‘online schema change’ tool. MySQL version 5.6 looks to address this issue by reducing the number of scenarios where a table is rebuilt and locked but it doesn’t yet cover all eventualities, for example when changing the data type of a column a full table rebuild is necessary. The topic of 5.6 Online Schema Change was discussed in great detail last year in the post, “Schema changes – what’s new in MySQL 5.6?” by Przemysław Malkowski

With new functionality arriving in MySQL 5.7, we look forward to non-blocking DDL operations such as; OPTIMIZE TABLE and RENAME INDEX. (More info)

The best advice for MySQL 5.6 users is to review the matrix to familiarize with situations where it might be best to look outside of MySQL to perform schema changes, the good news is that we’re on the right path to solving this natively.

Truth be told, a blocking alter is usually going to go unnoticed on a 30MB table and we tend to use a direct alter in this situation, but on a 30GB or 300GB table, we have some planning to do. If there is a period of time where the activity is low and the this is permissive of locking the table then sometimes it is better to execute within this window. Frequently though we are reactive to new SQL statements or a new performance issue and an emergency index is required to reduce the load on the master in order to improve the response time.

To pt-osc or not to pt-osc?

As mentioned, pt-online-schema-change is a fixture in our workflow. It’s usually the right way to go but we still have occasions where pt-online-schema-change cannot be used, for example; when a table already uses triggers. It’s important to remind ourselves of the steps that pt-online-schema-change traverses to complete its job. Let’s look at the source code to identify these;

[moore@localhost]$ egrep 'Step' pt-online-schema-change
# Step 1: Create the new table.
# Step 2: Alter the new, empty table. This should be very quick,
# Step 3: Create the triggers to capture changes on the original table and <--(metadata lock)
# Step 4: Copy rows.
# Step 5: Rename tables: orig -> old, new -> orig <--(metadata lock)
# Step 6: Update foreign key constraints if there are child tables.
# Step 7: Drop the old table.

[moore@localhost]$ egrep 'Step' pt-online-schema-change

# Step 1: Create the new table.

# Step 2: Alter the new, empty table. This should be very quick,

# Step 3: Create the triggers to capture changes on the original table and <--(metadata lock)

# Step 4: Copy rows.

# Step 5: Rename tables: orig -> old, new -> orig <--(metadata lock)

# Step 6: Update foreign key constraints if there are child tables.

# Step 7: Drop the old table.

I pick out steps 3 and 5 from above to highlight a source of potential downtime due to locks, but step 6 is also an area for concern since foreign keys can have nested actions and should be considered when planning these actions to avoid related tables from being rebuilt with a direct alter implicitly. There are several ways to approach a table with referential integrity constraints and they are detailed within the pt-osc documentation a good preparation step is to review the structure of your table including the constraints and how the ripples of the change can affect the tables around it.

Recently we were alerted to an incident after a client with a highly concurrent and highly transactional workload ran a standard pt-online-schema-change script over a large table. This appeared normal to them and a few hours later our pager man was notified that this client was experiencing max_connections limit reached. So what was going on? When pt-online-schema-change reached step 5 it tried to acquire a metadata lock to rename the original and the shadow table, however, this wasn’t immediately granted due to open transactions and thus threads began to queue behind the RENAME command. The actual effect this had on the client’s application was downtime. No new connections could be made and all existing threads were waiting behind the RENAME command.

Metadata locks
Introduced in 5.5.3 at server level. When a transaction starts it will acquire a metadata lock (independent of storage engine) on all tables it uses and then releases them when it’s finished it’s work. This ensures that nothing can alter the table definition whilst a transaction is open.

With some foresight and planning, we can avoid these situations with non-default pt-osc options, namely –nodrop-new-table and –no-swap-tables. This combination leaves both the shadow table and the triggers in place so that we can instigate an atomic RENAME when load permits.

EDIT: as of percona-toolkit version 2.2 we have a new variable –tries which in conjunction with –set-vars has been deployed to cover this scenario where various pt-osc operations could block waiting for a metadata lock. The default behavior of pt-osc (–set-vars) is to set the following session variables when it connects to the server;

wait_timeout=10000
innodb_lock_wait_timeout=1
lock_wait_timeout=60

when using –tries we can granularly identify the operation, try count and the wait interval between tries. This combination will ensure that pt-osc will kill it’s own waiting session in good time to avoid the thread pileup and provide us with a loop to attempt to acquire our metadata lock for triggers|rename|fk management;

–tries swap_tables:5:0.5,drop_triggers:5:0.5

The documentation is here https://www.percona.com/doc/percona-toolkit/2.2/pt-online-schema-change.html#cmdoption-pt-online-schema-change–tries

This illustrates that even with a tool like pt-online-schema-change it is important to understand the caveats presented with the solution you think is most adequate. To help decide the direction to take use of the flow chart to ensure you’re taking into account some of the caveats of the MySQL schema change. Be sure to read up on the recommended outcome though as there are uncharted areas such as disk space, IO load that is not featured on the diagram.

DDL Decision chart — Choosing the right DDL option

Ensure you know what effect the ALTER TABLE will have on your platform and pick the right method to suit your uptime. Sometimes that means delaying the change until a period of lighter use or utilizing a tool that will avoid holding a table locked for the duration of the operation. A direct ALTER is sometimes the answer like when you have triggers installed on a table.

– In most cases pt-osc is exactly what we need
– In many cases pt-osc is needed but the way in which it’s used needs tweaking
– In few cases pt-osc isn’t the right tool/method and we need to consider native blocking ALTER or using failovers to juggle the change into place on all hosts in the replica cluster.

If you want to learn more about avoiding avoidable downtime please tune into my webinar Wednesday, November 19 at 10 a.m. PST. It’s titled “Tips from the Trenches: A Guide to Preventing Downtime for the Over-Extended DBA.” Register now! (If you miss it, don’t worry: Catch the recording and download the slides from that same registration page.)

0 0 votes

Article Rating

9 Comments

Oldest

Newest Most Voted

Admin

Peter Zaitsev

11 years ago

Andrew,

I wonder if pt-osc can be modified to have short timeout for meta-data lock so it can ie retry with 1 sec meta data lock timeout many times on the table rename so essentially only rename the table when system is substantially idle enough ?

Author

Andrew Moore

11 years ago

Peter,

you raise a good point and this reminded me of a bug ticket I read some time ago, so I went to review the bugs db and of course, the percona-toolkit team were all over this issue as you can expect.

https://bugs.launchpad.net/percona-toolkit/+bug/1113301

I provided the edits in the post.

Ike Walker

11 years ago

Andrew,

Speaking of the bugs db, I submitted this one a while back with a suggested patch to allow pt-osc to work on tables with BEFORE triggers:

https://bugs.launchpad.net/percona-toolkit/+bug/1270328

Author

Andrew Moore

11 years ago

Hey Ike,

that makes sense I’m not sure why that might pose problems, I don’t know that code well enough to comment on it’s safety. On a related note and one for the future; 5.7 is set to allow more than one of each type of trigger per table and thus potentially the possibility that pt-osc will work on trigger populated tables. Seems like 5.7 is going to be one hell of a release!

Author

Andrew Moore

11 years ago

Not forgetting though that the use case for pt-osc should also shrink at 5.7 with further tweaks to online DDL.

Admin

Peter Zaitsev

11 years ago

Andrew,

There is another thing which MySQL 5.7 would need to fix to put pt-osc to rest it is some form of throttling. When you have operation going online sometimes you need it to go slower to ensure it does not monopolize system resources.

Fadi El-Eter (itoctopus)

11 years ago

Hi Andrew,

Usually what we do is to lock the updates at the application level and then issue the “Alter Table” command.

However, we did have an issue with a table that had millions of rows (it was an OpenX table) and that needed to be altered. I remember we eventually created another table, copied the data over, and then flipped the tables (using a transaction). The client accepted the fact that they had to lose some impressions.

Frederic Descamps

11 years ago

Just for info, with 5.6, if you are not sure if your ALTER will be blocking or not, add “LOCK = NONE” to it. If it requires to lock the table, the statement will return an error message and won’t be performed. So you really know if this is a good candidate for pt-osc or not 😉

Mahesh Patil

9 years ago

How will be the impact on 1Tb size of table ? Is it recommend to use pt-osc . We have slave as well and we do not want to block reads on slave. Please suggest ?