EmergencyEMERGENCY? Get 24/7 Help Now!

Avoiding MySQL ALTER table downtime

 | November 18, 2014 |  Posted In: Insight for DBAs, MySQL, Percona Managed Services, Percona Remote DBA

PREVIOUS POST
NEXT POST

MySQL table alterations can interrupt production traffic causing bad customer experience or in worst cases, loss of revenue. Not all DBAs, developers, syadmins know MySQL well enough to avoid this pitfall. DBAs usually encounter these kinds of production interruptions when working with upgrade scripts that touch both application and database or if an inexperienced admin/dev engineer perform the schema change without knowing how MySQL operates internally.

Truths
* Direct MySQL ALTER table locks for duration of change (pre-5.6)
* Online DDL in MySQL 5.6 is not always online and may incurr locks
* Even with Percona Toolkit‘s pt-online-schema-change there are several workloads that can experience blocking

Here on the Percona MySQL Managed Services team we encourage our clients to work with us when planning and performing schema migrations. We aim to ensure that we are using the best method available in their given circumstance. Our intentions to avoid blocking when performing DDL on large tables ensures that business can continue as usual whilst we strive to improve response time or add application functionality. The bottom line is that a business relying on access to its data cannot afford to be down during core trading hours.

Many of the installations we manage are still below MySQL 5.6, which requires us to seek workarounds to minimize the amount of disruption a migration can cause. This may entail slave promotion or changing the schema with an ‘online schema change’ tool. MySQL version 5.6 looks to address this issue by reducing the number of scenarios where a table is rebuilt and locked but it doesn’t yet cover all eventualities, for example when changing the data type of a column a full table rebuild is necessary. The topic of 5.6 Online Schema Change was discussed in great detail last year in the post, “Schema changes – what’s new in MySQL 5.6?” by Przemysław Malkowski

With new functionality arriving in MySQL 5.7, we look forward to non-blocking DDL operations such as; OPTIMIZE TABLE and RENAME INDEX. (More info)

The best advice for MySQL 5.6 users is to review the matrix to familiarize with situations where it might be best to look outside of MySQL to perform schema changes, the good news is that we’re on the right path to solving this natively.

Truth be told, a blocking alter is usually going to go unnoticed on a 30MB table and we tend to use a direct alter in this situation, but on a 30GB or 300GB table we have some planning to do. If there is a period of time where activity is low and the this is permissive of locking the table then sometimes it is better execute within this window. Frequently though we are reactive to new SQL statements or a new performance issue and an emergency index is required to reduce load on the master in order to improve the response time.

To pt-osc or not to pt-osc?

As mentioned, pt-online-schema-change is a fixture in our workflow. It’s usually the right way to go but we still have occasions where pt-online-schema-change cannot be used, for example; when a table already uses triggers. It’s an important to remind ourselves of the the steps that pt-online-schema-change traverses to complete it’s job. Lets look at the source code to identify these;

I pick out steps 3 and 5 from above to highlight a source of a source of potential downtime due to locks, but step 6 is also an area for concern since foreign keys can have nested actions and should be considered when planning these actions to avoid related tables from being rebuilt with a direct alter implicitly. There are several ways to approach a table with referential integrity constraints and they are detailed within the pt-osc documentation a good preparation step is to review the structure of your table including the constraints and how the ripples of the change can affect the tables around it.

Recently we were alerted to an incident after a client with a highly concurrent and highly transactional workload ran a standard pt-online-schema-change script over a large table. This appeared normal to them and a few hours later our pager man was notified that this client was experiencing max_connections limit reached. So what was going on? When pt-online-schema-change reached step 5 it tried to acquire a metadata lock to rename the the original and the shadow table, however this wasn’t immediately granted due to open transactions and thus threads began to queue behind the RENAME command. The actual effect this had on the client’s application was downtime. No new connections could be made and all existing threads were waiting behind the RENAME command.

Metadata locks
Introduced in 5.5.3 at server level. When a transaction starts it will acquire a metadata lock (independent of storage engine) on all tables it uses and then releases them when it’s finished it’s work. This ensures that nothing can alter the table definition whilst a transaction is open.

With some foresight and planning we can avoid these situations with non-default pt-osc options, namely –nodrop-new-table and –no-swap-tables. This combination leaves both the shadow table and the triggers inplace so that we can instigate an atomic RENAME when load permits.

EDIT: as of percona-toolkit version 2.2 we have a new variable –tries which in conjunction with –set-vars has been deployed to cover this scenario where various pt-osc operations could block waiting for a metadata lock. The default behaviour of pt-osc (–set-vars) is to set the following session variables when it connects to the server;

wait_timeout=10000
innodb_lock_wait_timeout=1
lock_wait_timeout=60

when using –tries we can granularly identify the operation, try count and the wait interval between tries. This combination will ensure that pt-osc will kill it’s own waiting session in good time to avoid the thread pileup and provide us with a loop to attempt to acquire our metadata lock for triggers|rename|fk management;

–tries swap_tables:5:0.5,drop_triggers:5:0.5

The documentation is here https://www.percona.com/doc/percona-toolkit/2.2/pt-online-schema-change.html#cmdoption-pt-online-schema-change–tries

This illustrates that even with a tool like pt-online-schema-change it is important to understand the caveats presented with the solution you think is most adequate. To help decide the direction to take use the flow chart to ensure you’re taking into account some of the caveats of the MySQL schema change. Be sure to read up on the recommended outcome though as there are uncharted areas such as disk space, IO load that are not featured on the diagram.

DDL Decision chart
Choosing the right DDL option

Ensure you know what effect ALTER TABLE will have on your platform and pick the right method to suit your uptime. Sometimes that means delaying the change until a period of lighter use or utilising a tool that will avoid holding a table locked for the duration of the operation. A direct ALTER is sometimes the answer like when you have triggers installed on a table.

– In most cases pt-osc is exactly what we need
– In many cases pt-osc is needed but the way in which it’s used needs tweaking
– In few cases pt-osc isn’t the right tool/method and we need to consider native blocking ALTER or using failovers to juggle the change into place on all hosts in the replica cluster.

If you want to learn more about avoiding avoidable downtime please tune into my webinar Wednesday, November 19 at 10 a.m. PST. It’s titled “Tips from the Trenches: A Guide to Preventing Downtime for the Over-Extended DBA.” Register now! (If you miss it, don’t worry: Catch the recording and download the slides from that same registration page.)

PREVIOUS POST
NEXT POST
Andrew Moore

Since fall 2013, Andrew has been working within Percona's Remote DBA team plying his experience to the client's environments and internal tools developed to keep operations slick. He lives in the UK with his young family and loves to complain about the less than perfect climate. Andrew makes time to pursue an amateur soccer career but won't be trading in MySQL any time soon.

9 Comments

  • Andrew,

    I wonder if pt-osc can be modified to have short timeout for meta-data lock so it can ie retry with 1 sec meta data lock timeout many times on the table rename so essentially only rename the table when system is substantially idle enough ?

  • Peter,

    you raise a good point and this reminded me of a bug ticket I read some time ago, so I went to review the bugs db and of course, the percona-toolkit team were all over this issue as you can expect.

    https://bugs.launchpad.net/percona-toolkit/+bug/1113301

    I provided the edits in the post.

  • Andrew,

    Speaking of the bugs db, I submitted this one a while back with a suggested patch to allow pt-osc to work on tables with BEFORE triggers:

    https://bugs.launchpad.net/percona-toolkit/+bug/1270328

  • Hey Ike,

    that makes sense I’m not sure why that might pose problems, I don’t know that code well enough to comment on it’s safety. On a related note and one for the future; 5.7 is set to allow more than one of each type of trigger per table and thus potentially the possibility that pt-osc will work on trigger populated tables. Seems like 5.7 is going to be one hell of a release!

  • Andrew,

    There is another thing which MySQL 5.7 would need to fix to put pt-osc to rest it is some form of throttling. When you have operation going online sometimes you need it to go slower to ensure it does not monopolize system resources.

  • Hi Andrew,

    Usually what we do is to lock the updates at the application level and then issue the “Alter Table” command.

    However, we did have an issue with a table that had millions of rows (it was an OpenX table) and that needed to be altered. I remember we eventually created another table, copied the data over, and then flipped the tables (using a transaction). The client accepted the fact that they had to lose some impressions.

  • Just for info, with 5.6, if you are not sure if your ALTER will be blocking or not, add “LOCK = NONE” to it. If it requires to lock the table, the statement will return an error message and won’t be performed. So you really know if this is a good candidate for pt-osc or not 😉

Leave a Reply