Starting from Percona Toolkit 3.6.0, you can resume pt-online-schema-change if it was interrupted. This blog describes the prerequisites and usage of the new –resume option.

To restart the job, you need to know where it failed. This is why the first option you must use is –history. It instructs pt-online-schema-change to store its progress in the history table. The history table name is percona.pt_osc_history by default. You can overwrite this name with the option –history-table. The history table structure is as follows:

Where

  • job_id is the unique identifier of the pt-online-schema-change run
  • db is the name of the database
  • tbl is the name of the altered table
  • new_table_name is the name of the table copy that pt-online-schema-change creates to perform its job
  • altr is the ALTER TABLE command that you specify as the –alter argument
  • args are the other arguments passed to the pt-online-schema-change command
  • lower_boundary is the lower boundary of the latest proceed chunk
  • upper_boundary is the upper boundary of the latest proceed chunk
  • done is the indicator of whether the job is finished or not
  • ts is the timestamp recorded when the job entry is updated. Practically, this is the timestamp when the latest chunk has been processed.

If your table index includes binary columns, such as BLOB or BINARY, create a history table with the option –binary-index. In this case, lower_boundary and upper_boundary will be of the BLOB datatype instead of TEXT. The same idea is used for the –replicate table in pt-table-checksum.

Once you run pt-online-schema-change with the –history option, it will create a record with the job details. If the job fails, you can easily find its unique identifier either in the tool output or in the history table and restart it.

Here is an example of the percona.pt_osc_history table content after a few pt-online-schema-change runs:

In the output above,

  •  Job 1 finished successfully:
  •  Jobs 2 and 3 did not even start:
  •  Job 4 was interrupted in the middle:

To restart the job, run pt-online-schema-change with option –resume=ID_OF_YOUR_PREVIOUSLY_FAILED_JOB. The tool will check if the previously created table copy and its triggers are still in place and will start the job from the chunk where it failed before.

You can use option –history together with –resume or you can skip it. Note, that if you do not use option –history, you won’t be able to restart the job if it fails one more time.

Another prerequisite for option –resume is –no-drop-new-table and –no-drop-triggers. If a new table and triggers are dropped, pt-online-schema-change cannot restart the job because no previously copied data is present.

Summary

If you want to resume pt-online-schema-change after the failure, use it with options –history –no-drop-new-table –no-drop-triggers. If the job fails, you can resume it by providing the job ID as an argument for the option –resume. You will find the ID of the failed job in the pt-online-schema-change output and the history table.

Subscribe
Notify of
guest

6 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Dinh Man

I think it would be great if there was an option to allow copying in the reverse order, meaning copying data from the newest to the oldest

lee

[ STEP 1 ]
pt-online-schema-change –print –progress time,5 \
–user=########## \
–port=##### \
–socket=/tmp/mysql_59306.sock \
–ask-pass \
D=sysbench,t=sbtest1 \
–alter “ADD INDEX test_idx2(pad)” \
–tries swap_tables:1:1,drop_triggers:1:1 \
–no-drop-old-table \
–no-drop-new-table \
–no-drop-triggers \
–chunk-size=500 \
–chunk-size-limit=600 \
–charset=UTF8 \
–alter-foreign-keys-method=auto \
–set-vars lock_wait_timeout=5 \
–critical-load threads_running=20 –max-load Threads_running=10 \
–history \
–execute 

* Automatically generated percona.pt _osc_history table
[ STEP 1 LOG ]
~~ 
INSERT LOW_PRIORITY IGNORE INTO sysbench._sbtest1_new (id, k, c, pad) SELECT id, k, c, pad FROM sysbench.sbtest1 FORCE INDEX(PRIMARY) WHERE ((id >= ?)) AND ((id <= ?)) LOCK IN SHARE MODE /*pt-online-schema-change 3627724 copy nibble*/
SELECT /*!40001 SQL_NO_CACHE */ id FROM sysbench.sbtest1 FORCE INDEX(PRIMARY) WHERE ((id >= ?)) ORDER BY id LIMIT ?, 2 /*next chunk boundary*/
Copying sysbench.sbtest1:  0% 24:15 remain
(생략)
Copying sysbench.sbtest1: 14% 08:39 remain
^C# Exiting on SIGINT.  
Not dropping triggers because the tool was interrupted. To drop the triggers, execute:
DROP TRIGGER IF EXISTS sysbench.pt_osc_sysbench_sbtest1_del;
DROP TRIGGER IF EXISTS sysbench.pt_osc_sysbench_sbtest1_upd;
DROP TRIGGER IF EXISTS sysbench.pt_osc_sysbench_sbtest1_ins;
Not dropping the new table sysbench._sbtest1_new because the tool was interrupted. To drop the new table, execute:
DROP TABLE IF EXISTS sysbench._sbtest1_new;
sysbench.sbtest1 was not altered.
History saved. Job id: 1

[ STEP 2 RESUME ]
pt-online-schema-change –print –progress time,5 \
–user=######### \
–port=##### \
–socket=/tmp/mysql_59306.sock \
–ask-pass \
D=sysbench,t=sbtest1 \
–alter “ADD INDEX test_idx2(pad)” \
–tries swap_tables:1:1,drop_triggers:1:1 \
–no-drop-old-table \
–no-drop-new-table \
–no-drop-triggers \
–chunk-size=500 \
–chunk-size-limit=600 \
–charset=UTF8 \
–alter-foreign-keys-method=auto \
–set-vars lock_wait_timeout=5 \
–critical-load threads_running=20 –max-load Threads_running=10 \
–history \
–resume=1 \
–execute 

If I’m testing like this and restart work through resume after stopping, I get the following failure.

[ STEP 2 ERROR ]
–history table pt_osc_history exists but does not have expected structure.
Explected structure:
 CREATE TABLE pt_osc_history (
   job_id     INT        NOT NULL AUTO_INCREMENT,
   db       CHAR(64)     NOT NULL,
   tbl      CHAR(64)     NOT NULL,
   new_table_name CHAR(64)     DEFAULT NULL,
   altr      TEXT       NOT NULL,
   args      TEXT       NOT NULL,
   lower_boundary TEXT,
   upper_boundary TEXT,
   done      ENUM(‘yes’,’no’) NOT NULL DEFAULT ‘no’,
   ts       TIMESTAMP     NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
   PRIMARY KEY (job_id)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4; 
User-defined table:
CREATE TABLE pt_osc_history (
 job_id int NOT NULL AUTO_INCREMENT,
 db char(64) COLLATE utf8mb4_general_ci NOT NULL,
 tbl char(64) COLLATE utf8mb4_general_ci NOT NULL,
 new_table_name char(64) COLLATE utf8mb4_general_ci DEFAULT NULL,
 altr text COLLATE utf8mb4_general_ci NOT NULL,
 args text COLLATE utf8mb4_general_ci NOT NULL,
 lower_boundary text COLLATE utf8mb4_general_ci,
 upper_boundary text COLLATE utf8mb4_general_ci,
 done enum(‘yes’,’no’) COLLATE utf8mb4_general_ci NOT NULL DEFAULT ‘no’,
 ts timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
 PRIMARY KEY (job_id)
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci
Exiting to avoid damage of the user-defined table. at ./pt-online-schema-change line 10816.

lee

I used MySQL 8.0.28

Mohamed

Why does it need to re-alter the table again if the statement didn’t change?
I’m trying to make a dummy alter to reclaim the free space of a very large table where the process takes more than a few days.
The tool fails once some db cleanup tasks kick in. But trying to resume re-run the “alter” statement after the table have data and it’s unable to complete as the table is quite large and runs out of space to run the DLL

I’m running this on Aurora Mysql BTW.