Announcement

Announcement Module
Collapse
No announcement yet.

pt-table-checksum questions about algorithm

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • pt-table-checksum questions about algorithm

    Hi,

    excuse me for my bad Perl knowledge, I was not able to read and understand pt-table-checksum code thoroughly so my questions might be quite simple. Here they are.

    1. It issues $ub_sql query to get {upper} boundary for current chunk and {next_lower} boundary for next chunk if available. Does it check replicas have no data between {upper} and {next_lower}?

    2. Issuing $ub_sql and then $checksum_dml partially reads same data twice (LIMIT with offset has to pass that many rows). Considering master server it is run on is often the main production server and is quite IO busy/bound it is either resource intensive or time consuming (or both). Why cannot it issue $checksum_dml only but with $ub_sql LIMIT instead of WHERE boundaries, log max boundary to percona table and check it is the same across all replicas? This will change O(N) complexity for running $ub_sql to O(1) complexity for running one row select from percona table.

    I mean query like this (for simplicity I assume index single column):

    REPLACE INTO `percona`.`checksums` (db, tbl, chunk, chunk_index, lower_boundary, upper_boundary, this_cnt, this_crc)
    SELECT 'schema', 'table', 'chunk', 'chunk_index', 'lower_boundary',
    MAX(`index_field`) AS upper_boundary,
    COUNT(*) AS cnt,
    COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS(' #', `field1`, ..., `fieldN`, CONCAT(ISNULL(`field1`), ..., ISNULL(`fieldN`)))) AS UNSIGNED)), 10, 16)), 0) AS crc
    FROM `schema`.`table` FORCE INDEX(`chunk_index`)
    WHERE `index_field`>=:upper_boundary_of_previous_chunk:
    ORDER BY `index_field`
    LIMIT ?
    /*checksum chunk*/

    At least one row which satisfies WHERE `index_field`=:upper_boundary_of_previous_chunk: will be read twice this way. This won't spoil checksum results but is unnecessary overhead. I'm quite sure this can be avoided with something like MAX(CASE ... END) however I don't want to go that far now.

    Thanks for the answers in advance.
    Last edited by neuen; 03-22-2014, 04:17 PM.
Working...
X