Buy Percona ServicesBuy Now!

pt-table-checksum complains replica is stopped when replica is not.

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • pt-table-checksum complains replica is stopped when replica is not.

    Hiya,




    I'm having some problems with pt-table-checksum where its reporting that the replica server is stopped when actually the replica server is running as far as I can tell.




    The following is the version of toolkit being used

    [root@master]# rpm -qa | grep percona-toolkit

    percona-toolkit-3.0.2-1.el7.x86_64



    [root@master]# pt-table-checksum --version

    pt-table-checksum 3.0.2




    MariaDB version for both master and slave

    [root@master]# rpm -qa | grep -i MariaDB-server

    MariaDB-server-10.2.5-1.el7.centos.x86_64







    The following is the config file for the tool on the master:




    [root@master]# cat /etc/percona-toolkit/pt-table-checksum.conf

    binary-index

    no-check-binlog-format

    recursion-method = processlist

    replicate = percona.epa_sum

    #replicate-check = TRUE

    #replicate-check-only = TRUE

    user = slave1

    password = <slave1_password>

    host = 127.0.0.1

    ignore-databases = mysql,information_schema,performance_schema

    databases = epa

    engines = InnoDB

    tables = Persons




    The following is from the master server when executing the pt-table-checksum tool:




    [root@master ~]# pt-table-checksum

    Replica zasperdump.androgogic.local is stopped. Waiting.

    Replica zasperdump.androgogic.local is stopped. Waiting.

    Replica zasperdump.androgogic.local is stopped. Waiting.

    Replica zasperdump.androgogic.local is stopped. Waiting.

    Replica zasperdump.androgogic.local is stopped. Waiting.

    :

    ^C# Caught SIGINT.

    TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE

    05-23T13:40:33 0 1 2 1 0 278.173 epa.Persons




    The "Replica zasperdump.androgogic.local is stopped. Waiting." server repeats itself indefinitely until I break the process.




    The following is the slave status for the master instance where the pt-table-checksum program was executed.




    MariaDB [(none)]> show slave 'master2' status\G;

    *************************** 1. row ***************************

    Slave_IO_State: Waiting for master to send event

    Master_Host: 192.168.82.23

    Master_User: slave1

    Master_Port: 3306

    Connect_Retry: 10

    Master_Log_File: master2-bin.000010

    Read_Master_Log_Pos: 3235061

    Relay_Log_File: mariadb-relay-bin-master2.000002

    Relay_Log_Pos: 2359544

    Relay_Master_Log_File: master2-bin.000010

    Slave_IO_Running: Yes

    Slave_SQL_Running: Yes

    Replicate_Do_DB: epa,percona

    Replicate_Ignore_DB:

    Replicate_Do_Table:

    Replicate_Ignore_Table:

    Replicate_Wild_Do_Table: epa.%,percona.%

    Replicate_Wild_Ignore_Table:

    Last_Errno: 0

    Last_Error:

    Skip_Counter: 0

    Exec_Master_Log_Pos: 3235061

    Relay_Log_Space: 2359863

    Until_Condition: None

    Until_Log_File:

    Until_Log_Pos: 0

    Master_SSL_Allowed: No

    Master_SSL_CA_File:

    Master_SSL_CA_Path:

    Master_SSL_Cert:

    Master_SSL_Cipher:

    Master_SSL_Key:

    Seconds_Behind_Master: 0

    Master_SSL_Verify_Server_Cert: No

    Last_IO_Errno: 0

    Last_IO_Error:

    Last_SQL_Errno: 0

    Last_SQL_Error:

    Replicate_Ignore_Server_Ids:

    Master_Server_Id: 2

    Master_SSL_Crl:

    Master_SSL_Crlpath:

    Using_Gtid: Current_Pos

    Gtid_IO_Pos: 101-1-7772,0-1-7,102-2-20215,30-3-7953

    Replicate_Do_Domain_Ids: 102

    Replicate_Ignore_Domain_Ids:

    Parallel_Mode: conservative

    SQL_Delay: 0

    SQL_Remaining_Delay: NULL

    Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it

    1 row in set (0.00 sec)




    As it may be obvious, the master is a test instance with no traffic on it and from what I'm able to discern there is no apparent lag on the slave as well.




    From the master server:




    MariaDB [(none)]> show master status;

    +--------------------+----------+--------------+------------------+

    | File | Position | Binlog_Do_DB | Binlog_Ignore_DB |

    +--------------------+----------+--------------+------------------+

    | master2-bin.000010 | 3235061 | | |

    +--------------------+----------+--------------+------------------+

    1 row in set (0.00 sec)




    MariaDB [(none)]> select binlog_gtid_pos('master2-bin.000010',3235061);

    +-----------------------------------------------+

    | binlog_gtid_pos('master2-bin.000010',3235061) |

    +-----------------------------------------------+

    | 102-2-20215 |

    +-----------------------------------------------+

    1 row in set (0.00 sec)




    The Gtid_IO_Pos seems to be perfectly matched on both the master and the slave.




    After a bit of digging into the /bin/pt-table-checksum code I found the below code where I think it was failing at:




    my @lagged_slaves = map { {cxn=>$_, lag=>undef} } @$slaves;

    while ( $oktorun->() && @lagged_slaves ) {

    PTDEBUG && _d('Checking slave lag');

    for my $i ( 0..$#lagged_slaves ) {

    my $lag = $get_lag->($lagged_slaves[$i]->{cxn});

    PTDEBUG && _d($lagged_slaves[$i]->{cxn}->name(),

    'slave lag:', $lag);

    if ( !defined $lag || $lag > $max_lag ) {

    $lagged_slaves[$i]->{lag} = $lag;

    }

    else {

    delete $lagged_slaves[$i];

    }

    }




    If I changed the first line from

    my @lagged_slaves = map { {cxn=>$_, lag=>undef} } @$slaves;

    to

    my @lagged_slaves = ();

    the program immediately works and returns the expected results.




    I'm not sure how the program is determining the slave lag but I suspect its missing something and hence throwing the "Replica zasperdump.androgogic.local is stopped. Waiting."




    Any assistance you can provide to get pt-table-checksum to work properly on my setup without the code hack is deeply appreciated.




    Thanks in advance.

  • #2
    Hi,

    I am running into the same issue. Have you found the fix for this yet.

    Thanks.

    Comment

    Working...
    X