Resume Your Failed pt-online-schema-change Job

Starting from Percona Toolkit 3.6.0, you can resume pt-online-schema-change if it was interrupted. This blog describes the prerequisites and usage of the new –resume option.

To restart the job, you need to know where it failed. This is why the first option you must use is –history. It instructs pt-online-schema-change to store its progress in the history table. The history table name is percona.pt_osc_history by default. You can overwrite this name with the option –history-table. The history table structure is as follows:

CREATE TABLE pt_osc_history (
  job_id     INT           NOT NULL AUTO_INCREMENT,
  db         CHAR(64)      NOT NULL,
  tbl        CHAR(64)      NOT NULL,
  new_table_name CHAR(64)      DEFAULT NULL,
  altr       TEXT          NOT NULL,
  args       TEXT          NOT NULL,
  lower_boundary TEXT,
  upper_boundary TEXT,
  done       ENUM('yes','no')  NOT NULL DEFAULT 'no',
  ts         TIMESTAMP     NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (job_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

CREATE TABLE pt_osc_history (

job_id INT NOT NULL AUTO_INCREMENT,

db CHAR(64) NOT NULL,

tbl CHAR(64) NOT NULL,

new_table_name CHAR(64) DEFAULT NULL,

altr TEXT NOT NULL,

args TEXT NOT NULL,

lower_boundary TEXT,

upper_boundary TEXT,

done ENUM('yes','no') NOT NULL DEFAULT 'no',

ts TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,

PRIMARY KEY (job_id)

) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Where

job_id is the unique identifier of the pt-online-schema-change run
db is the name of the database
tbl is the name of the altered table
new_table_name is the name of the table copy that pt-online-schema-change creates to perform its job
altr is the ALTER TABLE command that you specify as the –alter argument
args are the other arguments passed to the pt-online-schema-change command
lower_boundary is the lower boundary of the latest proceed chunk
upper_boundary is the upper boundary of the latest proceed chunk
done is the indicator of whether the job is finished or not
ts is the timestamp recorded when the job entry is updated. Practically, this is the timestamp when the latest chunk has been processed.

If your table index includes binary columns, such as BLOB or BINARY, create a history table with the option –binary-index. In this case, lower_boundary and upper_boundary will be of the BLOB datatype instead of TEXT. The same idea is used for the –replicate table in pt-table-checksum.

Once you run pt-online-schema-change with the –history option, it will create a record with the job details. If the job fails, you can easily find its unique identifier either in the tool output or in the history table and restart it.

Here is an example of the percona.pt_osc_history table content after a few pt-online-schema-change runs:

mysql> select * from percona.pt_osc_historyG
*************************** 1. row ***************************
        job_id: 1
            db: test
           tbl: pt1717
new_table_name: __pt1717_new
          altr: engine=INNODB
          args: {"history":1,"execute":1,"progress":["time","5"],"drop-new-table":0,"max-lag":"10","chunk-size":"10","alter":"engine=INNODB","drop-triggers":0}
lower_boundary: 9991
upper_boundary: 10000
          done: yes
            ts: 2024-06-05 13:32:09
*************************** 2. row ***************************
        job_id: 2
            db: test
           tbl: pt1717
new_table_name: __pt1717_new
          altr: engine=INNODB
          args: {"alter":"engine=INNODB","max-lag":"10","drop-new-table":0,"drop-triggers":0,"progress":["time","5"],"history":1,"chunk-size":"10","execute":1}
lower_boundary: NULL
upper_boundary: NULL
          done: no
            ts: 2024-06-05 13:32:09
*************************** 3. row ***************************
     job_id: 3
            db: test
           tbl: pt1717
new_table_name: __pt1717_new
          altr: engine=INNODB
          args: {"progress":["time","5"],"execute":1,"chunk-size":"10","history":1,"drop-new-table":0,"alter":"engine=INNODB","max-lag":"10","drop-triggers":0}
lower_boundary: NULL
upper_boundary: NULL
          done: no
            ts: 2024-06-05 13:32:09
*************************** 4. row ***************************
        job_id: 4
            db: test
           tbl: pt1717
new_table_name: __pt1717_new
          altr: engine=INNODB
          args: {"drop-triggers":0,"alter":"engine=INNODB","progress":["time","5"],"history":1,"chunk-size":"10","drop-new-table":0,"max-lag":"10","execute":1}
lower_boundary: 4901
upper_boundary: 4910
          done: no
            ts: 2024-06-05 13:32:12
4 rows in set (0,00 sec)

mysql> select * from percona.pt_osc_historyG

*************************** 1. row ***************************

job_id: 1

db: test

tbl: pt1717

new_table_name: __pt1717_new

altr: engine=INNODB

args: {"history":1,"execute":1,"progress":["time","5"],"drop-new-table":0,"max-lag":"10","chunk-size":"10","alter":"engine=INNODB","drop-triggers":0}

lower_boundary: 9991

upper_boundary: 10000

done: yes

ts: 2024-06-05 13:32:09

*************************** 2. row ***************************

job_id: 2

db: test

tbl: pt1717

new_table_name: __pt1717_new

altr: engine=INNODB

args: {"alter":"engine=INNODB","max-lag":"10","drop-new-table":0,"drop-triggers":0,"progress":["time","5"],"history":1,"chunk-size":"10","execute":1}

lower_boundary: NULL

upper_boundary: NULL

done: no

ts: 2024-06-05 13:32:09

*************************** 3. row ***************************

job_id: 3

db: test

tbl: pt1717

new_table_name: __pt1717_new

altr: engine=INNODB

args: {"progress":["time","5"],"execute":1,"chunk-size":"10","history":1,"drop-new-table":0,"alter":"engine=INNODB","max-lag":"10","drop-triggers":0}

lower_boundary: NULL

upper_boundary: NULL

done: no

ts: 2024-06-05 13:32:09

*************************** 4. row ***************************

job_id: 4

db: test

tbl: pt1717

new_table_name: __pt1717_new

altr: engine=INNODB

args: {"drop-triggers":0,"alter":"engine=INNODB","progress":["time","5"],"history":1,"chunk-size":"10","drop-new-table":0,"max-lag":"10","execute":1}

lower_boundary: 4901

upper_boundary: 4910

done: no

ts: 2024-06-05 13:32:12

4 rows in set (0,00 sec)

In the output above,

Job 1 finished successfully:

done: yes

1

done: yes
Jobs 2 and 3 did not even start:

lower_boundary: NULL upper_boundary: NULL done: no

1
2
3

lower_boundary: NULL
upper_boundary: NULL
done: no
Job 4 was interrupted in the middle:

lower_boundary: 4901 upper_boundary: 4910 done: no

1
2
3

lower_boundary: 4901
upper_boundary: 4910
done: no

To restart the job, run pt-online-schema-change with option –resume=ID_OF_YOUR_PREVIOUSLY_FAILED_JOB. The tool will check if the previously created table copy and its triggers are still in place and will start the job from the chunk where it failed before.

$ ./bin/pt-online-schema-change h=127.0.0.1,P=12345,u=msandbox,p=msandbox,D=test,t=pt1717 --execute --chunk-size 10 --max-lag 10 --alter 'engine=INNODB' --progress time,5 --no-drop-new-table --no-drop-triggers --history --resume=4
2024-06-05T13:54:11 Job 5 started.
Found 2 slaves:
s76 -> 127.0.0.1:12346
s76 -> 127.0.0.1:12347
Will check slave lag on:
s76 -> 127.0.0.1:12346
s76 -> 127.0.0.1:12347
Operation, tries, wait:
  analyze_table, 10, 1
  copy_rows, 10, 0.25
  create_triggers, 10, 1
  drop_triggers, 10, 1
  swap_tables, 10, 1
  update_foreign_keys, 10, 1
Altering `test`.`pt1717`...
Altering new table...
Altered `test`.`__pt1717_new` OK.
2024-06-05T13:54:11 Creating triggers...
2024-06-05T13:54:11 Created triggers OK.
2024-06-05T13:54:11 Copying approximately 9956 rows...
2024-06-05T13:54:14 Job 5 finished successfully.
2024-06-05T13:54:14 Copied rows OK.
2024-06-05T13:54:14 Analyzing new table...
2024-06-05T13:54:14 Swapping tables...
2024-06-05T13:54:14 Swapped original and new tables OK.
Not dropping old table because --no-drop-triggers was specified.
Not dropping triggers because --no-drop-triggers was specified.  To drop the triggers, execute:
DROP TRIGGER IF EXISTS `test`.`pt_osc_test_pt1717_del`
DROP TRIGGER IF EXISTS `test`.`pt_osc_test_pt1717_upd`
DROP TRIGGER IF EXISTS `test`.`pt_osc_test_pt1717_ins`
Successfully altered `test`.`pt1717`.

$ ./bin/pt-online-schema-change h=127.0.0.1,P=12345,u=msandbox,p=msandbox,D=test,t=pt1717 --execute --chunk-size 10 --max-lag 10 --alter 'engine=INNODB' --progress time,5 --no-drop-new-table --no-drop-triggers --history --resume=4

2024-06-05T13:54:11 Job 5 started.

Found 2 slaves:

s76 -> 127.0.0.1:12346

s76 -> 127.0.0.1:12347

Will check slave lag on:

s76 -> 127.0.0.1:12346

s76 -> 127.0.0.1:12347

Operation, tries, wait:

analyze_table, 10, 1

copy_rows, 10, 0.25

create_triggers, 10, 1

drop_triggers, 10, 1

swap_tables, 10, 1

update_foreign_keys, 10, 1

Altering `test`.`pt1717`...

Altering new table...

Altered `test`.`__pt1717_new` OK.

2024-06-05T13:54:11 Creating triggers...

2024-06-05T13:54:11 Created triggers OK.

2024-06-05T13:54:11 Copying approximately 9956 rows...

2024-06-05T13:54:14 Job 5 finished successfully.

2024-06-05T13:54:14 Copied rows OK.

2024-06-05T13:54:14 Analyzing new table...

2024-06-05T13:54:14 Swapping tables...

2024-06-05T13:54:14 Swapped original and new tables OK.

Not dropping old table because --no-drop-triggers was specified.

Not dropping triggers because --no-drop-triggers was specified. To drop the triggers, execute:

DROP TRIGGER IF EXISTS `test`.`pt_osc_test_pt1717_del`

DROP TRIGGER IF EXISTS `test`.`pt_osc_test_pt1717_upd`

DROP TRIGGER IF EXISTS `test`.`pt_osc_test_pt1717_ins`

Successfully altered `test`.`pt1717`.

You can use option –history together with –resume or you can skip it. Note, that if you do not use option –history, you won’t be able to restart the job if it fails one more time.

Another prerequisite for option –resume is –no-drop-new-table and –no-drop-triggers. If a new table and triggers are dropped, pt-online-schema-change cannot restart the job because no previously copied data is present.

$ ./bin/pt-online-schema-change h=127.0.0.1,P=12345,u=msandbox,p=msandbox,D=test,t=pt1717 --execute --chunk-size 10 --max-lag 10 --alter 'engine=INNODB' --progress time,5 --history --resume=9
2024-06-05T13:58:10 Job 10 started.
Found 1 slaves:
s76 -> 127.0.0.1:12346
Will check slave lag on:
s76 -> 127.0.0.1:12346
Operation, tries, wait:
  analyze_table, 10, 1
  copy_rows, 10, 0.25
  create_triggers, 10, 1
  drop_triggers, 10, 1
  swap_tables, 10, 1
  update_foreign_keys, 10, 1
Altering `test`.`pt1717`...
New table `test`.`___pt1717_new` not found, restart operation from scratch
`test`.`pt1717` was not altered.
History saved. Job id: 10

$ ./bin/pt-online-schema-change h=127.0.0.1,P=12345,u=msandbox,p=msandbox,D=test,t=pt1717 --execute --chunk-size 10 --max-lag 10 --alter 'engine=INNODB' --progress time,5 --history --resume=9

2024-06-05T13:58:10 Job 10 started.

Found 1 slaves:

s76 -> 127.0.0.1:12346

Will check slave lag on:

s76 -> 127.0.0.1:12346

Operation, tries, wait:

analyze_table, 10, 1

copy_rows, 10, 0.25

create_triggers, 10, 1

drop_triggers, 10, 1

swap_tables, 10, 1

update_foreign_keys, 10, 1

Altering `test`.`pt1717`...

New table `test`.`___pt1717_new` not found, restart operation from scratch

`test`.`pt1717` was not altered.

History saved. Job id: 10

Summary

If you want to resume pt-online-schema-change after the failure, use it with options –history –no-drop-new-table –no-drop-triggers. If the job fails, you can resume it by providing the job ID as an argument for the option –resume. You will find the ID of the failed job in the pt-online-schema-change output and the history table.

8 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Dinh Man

1 year ago

I think it would be great if there was an option to allow copying in the reverse order, meaning copying data from the newest to the oldest

lee

1 year ago

[ STEP 1 ]
pt-online-schema-change –print –progress time,5 \
–user=########## \
–port=##### \
–socket=/tmp/mysql_59306.sock \
–ask-pass \
D=sysbench,t=sbtest1 \
–alter “ADD INDEX test_idx2(pad)” \
–tries swap_tables:1:1,drop_triggers:1:1 \
–no-drop-old-table \
–no-drop-new-table \
–no-drop-triggers \
–chunk-size=500 \
–chunk-size-limit=600 \
–charset=UTF8 \
–alter-foreign-keys-method=auto \
–set-vars lock_wait_timeout=5 \
–critical-load threads_running=20 –max-load Threads_running=10 \
–history \
–execute

* Automatically generated percona.pt _osc_history table
[ STEP 1 LOG ]
~~
INSERT LOW_PRIORITY IGNORE INTO sysbench._sbtest1_new (id, k, c, pad) SELECT id, k, c, pad FROM sysbench.sbtest1 FORCE INDEX(PRIMARY) WHERE ((id >= ?)) AND ((id <= ?)) LOCK IN SHARE MODE /*pt-online-schema-change 3627724 copy nibble*/
SELECT /*!40001 SQL_NO_CACHE */ id FROM sysbench.sbtest1 FORCE INDEX(PRIMARY) WHERE ((id >= ?)) ORDER BY id LIMIT ?, 2 /*next chunk boundary*/
Copying sysbench.sbtest1: 0% 24:15 remain
(생략)
Copying sysbench.sbtest1: 14% 08:39 remain
^C# Exiting on SIGINT.
Not dropping triggers because the tool was interrupted. To drop the triggers, execute:
DROP TRIGGER IF EXISTS sysbench.pt_osc_sysbench_sbtest1_del;
DROP TRIGGER IF EXISTS sysbench.pt_osc_sysbench_sbtest1_upd;
DROP TRIGGER IF EXISTS sysbench.pt_osc_sysbench_sbtest1_ins;
Not dropping the new table sysbench._sbtest1_new because the tool was interrupted. To drop the new table, execute:
DROP TABLE IF EXISTS sysbench._sbtest1_new;
sysbench.sbtest1 was not altered.
History saved. Job id: 1

[ STEP 2 RESUME ]
pt-online-schema-change –print –progress time,5 \
–user=######### \
–port=##### \
–socket=/tmp/mysql_59306.sock \
–ask-pass \
D=sysbench,t=sbtest1 \
–alter “ADD INDEX test_idx2(pad)” \
–tries swap_tables:1:1,drop_triggers:1:1 \
–no-drop-old-table \
–no-drop-new-table \
–no-drop-triggers \
–chunk-size=500 \
–chunk-size-limit=600 \
–charset=UTF8 \
–alter-foreign-keys-method=auto \
–set-vars lock_wait_timeout=5 \
–critical-load threads_running=20 –max-load Threads_running=10 \
–history \
–resume=1 \
–execute

If I’m testing like this and restart work through resume after stopping, I get the following failure.

[ STEP 2 ERROR ]
–history table pt_osc_history exists but does not have expected structure.
Explected structure:
CREATE TABLE pt_osc_history (
   job_id     INT        NOT NULL AUTO_INCREMENT,
   db       CHAR(64)     NOT NULL,
   tbl      CHAR(64)     NOT NULL,
   new_table_name CHAR(64)     DEFAULT NULL,
   altr      TEXT       NOT NULL,
   args      TEXT       NOT NULL,
   lower_boundary TEXT,
   upper_boundary TEXT,
   done      ENUM(‘yes’,’no’) NOT NULL DEFAULT ‘no’,
   ts       TIMESTAMP     NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
   PRIMARY KEY (job_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
User-defined table:
CREATE TABLE pt_osc_history (
job_id int NOT NULL AUTO_INCREMENT,
db char(64) COLLATE utf8mb4_general_ci NOT NULL,
tbl char(64) COLLATE utf8mb4_general_ci NOT NULL,
new_table_name char(64) COLLATE utf8mb4_general_ci DEFAULT NULL,
altr text COLLATE utf8mb4_general_ci NOT NULL,
args text COLLATE utf8mb4_general_ci NOT NULL,
lower_boundary text COLLATE utf8mb4_general_ci,
upper_boundary text COLLATE utf8mb4_general_ci,
done enum(‘yes’,’no’) COLLATE utf8mb4_general_ci NOT NULL DEFAULT ‘no’,
ts timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (job_id)
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci
Exiting to avoid damage of the user-defined table. at ./pt-online-schema-change line 10816.

Sveta Smirnova

Author

Reply to lee

1 year ago

Which MySQL server version do you use? Please file a bug at https://perconadev.atlassian.net

lee

Reply to Sveta Smirnova

1 year ago

I used MySQL 8.0.28

Mohamed

10 months ago

Why does it need to re-alter the table again if the statement didn’t change?
I’m trying to make a dummy alter to reclaim the free space of a very large table where the process takes more than a few days.
The tool fails once some db cleanup tasks kick in. But trying to resume re-run the “alter” statement after the table have data and it’s unable to complete as the table is quite large and runs out of space to run the DLL

I’m running this on Aurora Mysql BTW.

Sveta Smirnova

Author

Reply to Mohamed

9 months ago

Please send full command you use when run pt-online-schema-change for the first time and when you resume the job.

Mohamed

Reply to Sveta Smirnova

1 month ago

Here’s the command I ran originally

pt-online-schema-change –alter “ENGINE=InnoDB” D=”MyDB”,t=”MyTable”,u=”MyUser”,h=”MyHost” –history –nodrop-new-table –nodrop-triggers –ask-pass –execute
Then I resumed the command by just adding –resume

pt-online-schema-change –alter “ENGINE=InnoDB” D=”MyDB”,t=”MyTable”,u=”MyUser”,h=”MyHost” –history –nodrop-new-table –nodrop-triggers –ask-pass –execute –resume 12
I’m using pt-osc as an escape hatch to optimize a large table (+1TB) on Aurora MySQL bypassing the limits of the local instance disk space. It works with most tables without needing to resume. But one particular table sees a lot of activity and interferes with pt-osc causing it to fail.

Whenever I resume, it re-alters the “MyTable__new”. This alter statement creates a new temporary MySQL table under the hood and it’s not affordable when the new “MyTable__new” has already reached a large size more than what’s available for the local RDS disk. It eventually causes the Aurora RDS instance to crash and restart because there is no more local disk space.

Can there be an option to not run the “alter” statement again when progress has been made copying rows and go immediately to copy the rows?

Last edited 1 month ago by Mohamed

Sveta Smirnova

Author

Reply to Mohamed

1 month ago

This was reported at https://perconadev.atlassian.net/browse/PT-2422 and fixed in the upcoming version 3.7.1. But since pt-online-schema-change is written in the scripting language, you can take the updated code from https://raw.githubusercontent.com/percona/percona-toolkit/refs/heads/3.x/bin/pt-online-schema-change and use it now. Or you can take the latest build from https://github.com/percona/percona-toolkit/actions/workflows/toolkit.yml

Please try it and let us know if you are still having the same issue.

MySQL 5.7
Support

Compare Percona to Leading Database Solutions

Software
Downloads

Valkey Contribution

Product Documentation

Resource Hub

Why Percona for MongoDB?

Why Percona for PostgreSQL?

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Resume Your Failed pt-online-schema-change Job

Summary

Related Blog Articles

RECOMMENDED ARTICLES

A Tale of Two Databases: No-Op Updates in PostgreSQL and MySQL

PostgreSQL OIDC Authentication with pg_oidc_validator

Surprise with innodb_doublewrite_pages in MySQL 8.0.20+

MOST POPULAR ARTICLES

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL Performance Tuning: Maximizing Database Efficiency and Speed

The Ultimate Guide to Open Source Databases

MySQL 5.7 Support

Compare Percona to Leading Database Solutions

Software Downloads

Valkey Contribution

Product Documentation

Resource Hub

Why Percona for MongoDB?

Why Percona for PostgreSQL?

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Resume Your Failed pt-online-schema-change Job

Summary

About the Author

Share This Post!

Stay up to date with the Percona Blog

Related Blog Articles

RECOMMENDED ARTICLES

A Tale of Two Databases: No-Op Updates in PostgreSQL and MySQL

PostgreSQL OIDC Authentication with pg_oidc_validator

Surprise with innodb_doublewrite_pages in MySQL 8.0.20+

MOST POPULAR ARTICLES

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL Performance Tuning: Maximizing Database Efficiency and Speed

The Ultimate Guide to Open Source Databases

MySQL 5.7
Support

Software
Downloads