Starting from Percona Toolkit 3.6.0, you can resume pt-online-schema-change if it was interrupted. This blog describes the prerequisites and usage of the new –resume option.
To restart the job, you need to know where it failed. This is why the first option you must use is –history. It instructs pt-online-schema-change to store its progress in the history table. The history table name is percona.pt_osc_history by default. You can overwrite this name with the option –history-table. The history table structure is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
CREATE TABLE pt_osc_history ( job_id INT NOT NULL AUTO_INCREMENT, db CHAR(64) NOT NULL, tbl CHAR(64) NOT NULL, new_table_name CHAR(64) DEFAULT NULL, altr TEXT NOT NULL, args TEXT NOT NULL, lower_boundary TEXT, upper_boundary TEXT, done ENUM('yes','no') NOT NULL DEFAULT 'no', ts TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (job_id) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4; |
Where
- job_id is the unique identifier of the pt-online-schema-change run
- db is the name of the database
- tbl is the name of the altered table
- new_table_name is the name of the table copy that pt-online-schema-change creates to perform its job
- altr is the ALTER TABLE command that you specify as the –alter argument
- args are the other arguments passed to the pt-online-schema-change command
- lower_boundary is the lower boundary of the latest proceed chunk
- upper_boundary is the upper boundary of the latest proceed chunk
- done is the indicator of whether the job is finished or not
- ts is the timestamp recorded when the job entry is updated. Practically, this is the timestamp when the latest chunk has been processed.
If your table index includes binary columns, such as BLOB or BINARY, create a history table with the option –binary-index. In this case, lower_boundary and upper_boundary will be of the BLOB datatype instead of TEXT. The same idea is used for the –replicate table in pt-table-checksum.
Once you run pt-online-schema-change with the –history option, it will create a record with the job details. If the job fails, you can easily find its unique identifier either in the tool output or in the history table and restart it.
Here is an example of the percona.pt_osc_history table content after a few pt-online-schema-change runs:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
mysql> select * from percona.pt_osc_historyG *************************** 1. row *************************** job_id: 1 db: test tbl: pt1717 new_table_name: __pt1717_new altr: engine=INNODB args: {"history":1,"execute":1,"progress":["time","5"],"drop-new-table":0,"max-lag":"10","chunk-size":"10","alter":"engine=INNODB","drop-triggers":0} lower_boundary: 9991 upper_boundary: 10000 done: yes ts: 2024-06-05 13:32:09 *************************** 2. row *************************** job_id: 2 db: test tbl: pt1717 new_table_name: __pt1717_new altr: engine=INNODB args: {"alter":"engine=INNODB","max-lag":"10","drop-new-table":0,"drop-triggers":0,"progress":["time","5"],"history":1,"chunk-size":"10","execute":1} lower_boundary: NULL upper_boundary: NULL done: no ts: 2024-06-05 13:32:09 *************************** 3. row *************************** job_id: 3 db: test tbl: pt1717 new_table_name: __pt1717_new altr: engine=INNODB args: {"progress":["time","5"],"execute":1,"chunk-size":"10","history":1,"drop-new-table":0,"alter":"engine=INNODB","max-lag":"10","drop-triggers":0} lower_boundary: NULL upper_boundary: NULL done: no ts: 2024-06-05 13:32:09 *************************** 4. row *************************** job_id: 4 db: test tbl: pt1717 new_table_name: __pt1717_new altr: engine=INNODB args: {"drop-triggers":0,"alter":"engine=INNODB","progress":["time","5"],"history":1,"chunk-size":"10","drop-new-table":0,"max-lag":"10","execute":1} lower_boundary: 4901 upper_boundary: 4910 done: no ts: 2024-06-05 13:32:12 4 rows in set (0,00 sec) |
In the output above,
- Job 1 finished successfully:
1done: yes - Jobs 2 and 3 did not even start:
123lower_boundary: NULLupper_boundary: NULLdone: no - Job 4 was interrupted in the middle:
123lower_boundary: 4901upper_boundary: 4910done: no
To restart the job, run pt-online-schema-change with option –resume=ID_OF_YOUR_PREVIOUSLY_FAILED_JOB. The tool will check if the previously created table copy and its triggers are still in place and will start the job from the chunk where it failed before.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
$ ./bin/pt-online-schema-change h=127.0.0.1,P=12345,u=msandbox,p=msandbox,D=test,t=pt1717 --execute --chunk-size 10 --max-lag 10 --alter 'engine=INNODB' --progress time,5 --no-drop-new-table --no-drop-triggers --history --resume=4 2024-06-05T13:54:11 Job 5 started. Found 2 slaves: s76 -> 127.0.0.1:12346 s76 -> 127.0.0.1:12347 Will check slave lag on: s76 -> 127.0.0.1:12346 s76 -> 127.0.0.1:12347 Operation, tries, wait: analyze_table, 10, 1 copy_rows, 10, 0.25 create_triggers, 10, 1 drop_triggers, 10, 1 swap_tables, 10, 1 update_foreign_keys, 10, 1 Altering `test`.`pt1717`... Altering new table... Altered `test`.`__pt1717_new` OK. 2024-06-05T13:54:11 Creating triggers... 2024-06-05T13:54:11 Created triggers OK. 2024-06-05T13:54:11 Copying approximately 9956 rows... 2024-06-05T13:54:14 Job 5 finished successfully. 2024-06-05T13:54:14 Copied rows OK. 2024-06-05T13:54:14 Analyzing new table... 2024-06-05T13:54:14 Swapping tables... 2024-06-05T13:54:14 Swapped original and new tables OK. Not dropping old table because --no-drop-triggers was specified. Not dropping triggers because --no-drop-triggers was specified. To drop the triggers, execute: DROP TRIGGER IF EXISTS `test`.`pt_osc_test_pt1717_del` DROP TRIGGER IF EXISTS `test`.`pt_osc_test_pt1717_upd` DROP TRIGGER IF EXISTS `test`.`pt_osc_test_pt1717_ins` Successfully altered `test`.`pt1717`. |
You can use option –history together with –resume or you can skip it. Note, that if you do not use option –history, you won’t be able to restart the job if it fails one more time.
Another prerequisite for option –resume is –no-drop-new-table and –no-drop-triggers. If a new table and triggers are dropped, pt-online-schema-change cannot restart the job because no previously copied data is present.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
$ ./bin/pt-online-schema-change h=127.0.0.1,P=12345,u=msandbox,p=msandbox,D=test,t=pt1717 --execute --chunk-size 10 --max-lag 10 --alter 'engine=INNODB' --progress time,5 --history --resume=9 2024-06-05T13:58:10 Job 10 started. Found 1 slaves: s76 -> 127.0.0.1:12346 Will check slave lag on: s76 -> 127.0.0.1:12346 Operation, tries, wait: analyze_table, 10, 1 copy_rows, 10, 0.25 create_triggers, 10, 1 drop_triggers, 10, 1 swap_tables, 10, 1 update_foreign_keys, 10, 1 Altering `test`.`pt1717`... New table `test`.`___pt1717_new` not found, restart operation from scratch `test`.`pt1717` was not altered. History saved. Job id: 10 |
Summary
If you want to resume pt-online-schema-change after the failure, use it with options –history –no-drop-new-table –no-drop-triggers. If the job fails, you can resume it by providing the job ID as an argument for the option –resume. You will find the ID of the failed job in the pt-online-schema-change output and the history table.
I think it would be great if there was an option to allow copying in the reverse order, meaning copying data from the newest to the oldest
[ STEP 1 ]
pt-online-schema-change –print –progress time,5 \
–user=########## \
–port=##### \
–socket=/tmp/mysql_59306.sock \
–ask-pass \
D=sysbench,t=sbtest1 \
–alter “ADD INDEX test_idx2(pad)” \
–tries swap_tables:1:1,drop_triggers:1:1 \
–no-drop-old-table \
–no-drop-new-table \
–no-drop-triggers \
–chunk-size=500 \
–chunk-size-limit=600 \
–charset=UTF8 \
–alter-foreign-keys-method=auto \
–set-vars lock_wait_timeout=5 \
–critical-load threads_running=20 –max-load Threads_running=10 \
–history \
–execute
* Automatically generated percona.pt _osc_history table
[ STEP 1 LOG ]
~~
INSERT LOW_PRIORITY IGNORE INTO
sysbench
._sbtest1_new
(id
,k
,c
,pad
) SELECTid
,k
,c
,pad
FROMsysbench
.sbtest1
FORCE INDEX(PRIMARY
) WHERE ((id
>= ?)) AND ((id
<= ?)) LOCK IN SHARE MODE /*pt-online-schema-change 3627724 copy nibble*/SELECT /*!40001 SQL_NO_CACHE */
id
FROMsysbench
.sbtest1
FORCE INDEX(PRIMARY
) WHERE ((id
>= ?)) ORDER BYid
LIMIT ?, 2 /*next chunk boundary*/Copying
sysbench
.sbtest1
: 0% 24:15 remain(생략)
Copying
sysbench
.sbtest1
: 14% 08:39 remain^C# Exiting on SIGINT.
Not dropping triggers because the tool was interrupted. To drop the triggers, execute:
DROP TRIGGER IF EXISTS
sysbench
.pt_osc_sysbench_sbtest1_del
;DROP TRIGGER IF EXISTS
sysbench
.pt_osc_sysbench_sbtest1_upd
;DROP TRIGGER IF EXISTS
sysbench
.pt_osc_sysbench_sbtest1_ins
;Not dropping the new table
sysbench
._sbtest1_new
because the tool was interrupted. To drop the new table, execute:DROP TABLE IF EXISTS
sysbench
._sbtest1_new
;sysbench
.sbtest1
was not altered.History saved. Job id: 1
[ STEP 2 RESUME ]
pt-online-schema-change –print –progress time,5 \
–user=######### \
–port=##### \
–socket=/tmp/mysql_59306.sock \
–ask-pass \
D=sysbench,t=sbtest1 \
–alter “ADD INDEX test_idx2(pad)” \
–tries swap_tables:1:1,drop_triggers:1:1 \
–no-drop-old-table \
–no-drop-new-table \
–no-drop-triggers \
–chunk-size=500 \
–chunk-size-limit=600 \
–charset=UTF8 \
–alter-foreign-keys-method=auto \
–set-vars lock_wait_timeout=5 \
–critical-load threads_running=20 –max-load Threads_running=10 \
–history \
–resume=1 \
–execute
If I’m testing like this and restart work through resume after stopping, I get the following failure.
[ STEP 2 ERROR ]
–history table pt_osc_history exists but does not have expected structure.
Explected structure:
CREATE TABLE
pt_osc_history
(job_id INT NOT NULL AUTO_INCREMENT,
db CHAR(64) NOT NULL,
tbl CHAR(64) NOT NULL,
new_table_name CHAR(64) DEFAULT NULL,
altr TEXT NOT NULL,
args TEXT NOT NULL,
lower_boundary TEXT,
upper_boundary TEXT,
done ENUM(‘yes’,’no’) NOT NULL DEFAULT ‘no’,
ts TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (job_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
User-defined table:
CREATE TABLE
pt_osc_history
(job_id
int NOT NULL AUTO_INCREMENT,db
char(64) COLLATE utf8mb4_general_ci NOT NULL,tbl
char(64) COLLATE utf8mb4_general_ci NOT NULL,new_table_name
char(64) COLLATE utf8mb4_general_ci DEFAULT NULL,altr
text COLLATE utf8mb4_general_ci NOT NULL,args
text COLLATE utf8mb4_general_ci NOT NULL,lower_boundary
text COLLATE utf8mb4_general_ci,upper_boundary
text COLLATE utf8mb4_general_ci,done
enum(‘yes’,’no’) COLLATE utf8mb4_general_ci NOT NULL DEFAULT ‘no’,ts
timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,PRIMARY KEY (
job_id
)) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci
Exiting to avoid damage of the user-defined table. at ./pt-online-schema-change line 10816.
Which MySQL server version do you use? Please file a bug at https://perconadev.atlassian.net
I used MySQL 8.0.28
Why does it need to re-alter the table again if the statement didn’t change?
I’m trying to make a dummy alter to reclaim the free space of a very large table where the process takes more than a few days.
The tool fails once some db cleanup tasks kick in. But trying to resume re-run the “alter” statement after the table have data and it’s unable to complete as the table is quite large and runs out of space to run the DLL
I’m running this on Aurora Mysql BTW.
Please send full command you use when run pt-online-schema-change for the first time and when you resume the job.