Announcement

Announcement Module
Collapse
No announcement yet.

Next step in debugging replication error? (HA_ERR_KEY_NOT_FOUND)

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Next step in debugging replication error? (HA_ERR_KEY_NOT_FOUND)

    I have MySQL 5.5.19 on both master and slave in a simple master->slave replication setup using MIXED binlog mode. Slave has nothing that writes to the replicated InnoDB database. Isolation is the default REPEATABLE-READ.

    Every once a short while I get the 1032 error code, HA_ERR_KEY_NOT_FOUND, on the slave for DELETE operation. After digging into the relay log for the problematic log position, I can verify that the record in question was indeed non-existent in the table. I can also verify that the record had an INSERT query earlier in the binlog, however I don't know how to verify that it either took place or not just from reading the relay log. Due to that I also cannot tell if the 1032 error was because the record got deleted somehow previously or was never inserted in the first place.

    I have tried re-sync the master and slave with fresh dump, lock, reset logs, start position, etc.,... I also checked the problematic table and found the keys and rows are exactly the same on master and slave after the re-synch. However the error still pops up after some hours of normal operation. The servers were up all the time with no other errors or shutdowns.

    Is there any way to track how this might have happened? Also is there is way to tell if the particular replication was done in ROW or STATEMENT mode?

    This is a fairly new setup so I don't rule out having less-than-optimal settings in my.cnf... What setting could affect replication to drop/miss?

    Thank you in advance.

  • #2
    If the row is gone, it's most likely that something external to replication is deleting it. I'd enable binary logging on the replica, and if possible (if this replica isn't also a master) make sure that log-slave-updates is NOT enabled.

    Now any updates in the replica's binary log are due to changes that took place on the replica directly, which is usually a mistake of some kind (app connects to the wrong place, for example). You can also try to avoid this with @@read_only on the replica, but that only works if you haven't given normal users the SUPER privilege, which I actually see a lot in practice even though it's not a good idea.

    Comment


    • #3
      Thanks for the reply, that's a useful trick to know when debugging replication problems. I think I have an idea on what the problem might be...

      I have replicate-do-db set to monitor this DB only. There is a maintenance script that starts with "USE"ing another database but also does cross database changes to this DB on the master. It looks like none of the conditions that triggers STATEMENT to ROW logging switch applies after looking at what the script does... I will modify the script and see if this still happens.

      Comment


      • #4
        xaprb wrote on Fri, 17 February 2012 16:09
        If the row is gone, it's most likely that something external to replication is deleting it. I'd enable binary logging on the replica, and if possible (if this replica isn't also a master) make sure that log-slave-updates is NOT enabled.

        Now any updates in the replica's binary log are due to changes that took place on the replica directly, which is usually a mistake of some kind (app connects to the wrong place, for example). You can also try to avoid this with @@read_only on the replica, but that only works if you haven't given normal users the SUPER privilege, which I actually see a lot in practice even though it's not a good idea.
        Even if he enables log-slave-updates he can see in the slave's binary logs where the statement came from.
        every server has a unique id, so he can see the server id for that statement in the binlog.

        Comment


        • #5
          To wrap up this topic, the maintenance script was the culprit. After changing the connection to initiate using the replicating DB, the error has gone away and has been running normally since. Thanks for all your help.

          Comment

          Working...
          X