Emergency

Transaction records missing on master but present on slave? Race condition?

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Transaction records missing on master but present on slave? Race condition?

    My company has been running Percona Server 5.6.13-61.0 for the last year or so. The database handles order transactions, and is replicated to another slave server (same exact version) offsite.

    Recently one of our clients reported that an order seemingly disappeared. We randomly would get reports like this -- maybe 1 a month, so we looked into it deeper. By chance, we discovered that the records for the order existed on the slave, but were missing on the master! We looked for other records on the slave that were missing on the master, and indeed there were several over the last year.

    Looking into it even deeper still, we tried examining what might be special about the orders missing on the master. What we discovered was that for each one of them, they had another unrelated transaction (which do show up on the master) timestamped at the exact same time. For example say you have two users on a site ordering and they hit checkout at the very exact same time, that kind of scenario. Both get committed to the slave but only one makes it to master.

    How is it possible for this scenario to happen? What this seems like to me is some sort of race condition bug? Is this an issue known to be a problem in our version of Percona?

  • #2
    With mysql replication, slave servers receive changes via reading and replaying master server's binlog which records changes that have been committed on the master.

    1. Please check if there is any program that writes directly to the slave. If log_bin is turned on on the slaves, check server_id in binlog. Or set read_only on slave and see if any application reports error.
    2. If using statement-based replication, check master's error log for "Unsafe statement". Those are DMLs that are not deterministic and could potential result in different changes on master and slaves.
    3. Compare table schema between master/slaves, especially PK, UK, FK cascade options, and Triggers.
    4. Check for any jobs on master that set session sql_log_bin = 0, which could potential change/delete order records.

    Comment


    • #3

      1.) I have verified no program is writing directly to slave. The master is the only one connected to the webserver that customers order from and there is no way for the slave to have been given that transaction directly.

      2.) We already scanned the MySQL error log for any entry at the time of the transaction, and there were non at all at that time.

      3.) I compared the schemas and they are identical. The only thing I noted was there is a table that is part of the transaction that does not have any keys defined on it and a delete is done on it. But that table is identical between master and slave.

      4.) There are no sql_log_bin = 0 statements.

      The fact that every missing order we found on the slave that is absent on the master has another transaction at the exach hh:mm:ss I believe is extremely significant. That can't just be by chance?

      Comment


      • #4
        Sorry I don't have much help to offer, but I would be very interested in learning more about the issue you are describing.

        Disclaimer: I'm not with Percona, and nowhere near as good as they are.

        I would look first at following:
        Is the slave server set with log_slave_updates? Would you be able to compare the binary logs of the master and server to examine what was written to the binary log on both for a given orderID? If it's on the slave, it must have been on the master at some point.

        ​Are you using statement or row based replication?

        What type of unique identifier does your orders table use? If it's autoincremented, is there a gap on the master for the missing order?

        Maybe some more details about what's happening in that transaction when an order is written would help.

        Best of luck figuring this one out!


        Comment


        • #5
          The slave is set with with log_slave_updates

          I am using MIXED replication

          The unique identifier is a non-autoincrement integer

          It just happened again this weekend. With the exact same fingerprint -- two transactions submitted at the exach hh:mm:ss to the master, and only one is recorded on the master, the other makes its way to the slave. And I am scrambling to figure it out, but the point is THIS SHOULD NOT BE POSSIBLE!

          The thing is, I don't think this has any real relationship to replication at all. Its just that replication was what helped us verify something is very wrong. These transactions simply vanish on the master.

          Comment


          • #6
            This continues to happen -- exact same pattern of transactions disappearing on the master that appear on the slave!!! It only happens when there are two transactions to the same database committing at exact same time, usually during peak times when the server is under load.

            I am willing to try anything, including updating to the latest version of Percona. If i updated from 5.6.13-61.0 to 5.6.21-70.0, would i need to dump and reload the databases, or is it as simple as updating the binaries?

            Comment


            • #7
              Originally posted by cmollet View Post
              Would you be able to compare the binary logs of the master and server to examine what was written to the binary log on both for a given orderID? If it's on the slave, it must have been on the master at some point.
              What type of unique identifier does your orders table use? If it's autoincremented, is there a gap on the master for the missing order?
              I just did the last things you asked here. I used mysqlbinlog to analyze the log on the master. The insert statements (and commit) of the transaction are all there, as on the slave. But there is no trace of the records that should have been inserted in in the master database.

              I did more analysis and one of the tables in the transaction does have an auto-increment primary key. And THERE IS A GAP on the master. The record missing on the master that exists on the slave should have that missing number in the sequence.



              Comment


              • #8
                Ok, I think I have found another clue.

                I did a SHOW ENGINE INNODB STATUS, and on that there is a "LATEST DETECTED DEADLOCK" section and it lists a couple statements i can tell are from a missing transaction on the master as "WAITING FOR THIS LOCK TO BE GRANTED".

                Then the last line of that section says "WE ROLL BACK TRANSACTION".

                There are two huge problems here:

                1.) records for that transaction, which I presume was rolled back on the master, exist on the slave but not the master. THIS SHOULD NOT BE POSSIBLE??
                2.) no error was sent to the application layer -- the master acted as if the COMMIT worked.
                Last edited by ronaldjeremy; 11-09-2014, 08:04 AM.

                Comment


                • #9
                  Hi! Sorry for upping this old thread, but recently we encountered similar issues so it seems that there are some really critical bug in InnoDB.

                  Please see SQLFiddle link in attachment with db schema.
                  We enabled general log in order to investigate anomaly.

                  Code:
                   
                   2018-02-08T16:00:29.391039Z	107061357 Query	START TRANSACTION  2018-02-08T16:00:29.394849Z	107061357 Execute	select * from `currencies` where `currencies`.`id` = 1 limit 1 lock in share mode  2018-02-08T16:00:29.396052Z	107061357 Execute	select * from `operations` where `user_id` = 1155 and `currency_id` = 1 and `op_hash` = 'XXXCCC' limit 1 for update  2018-02-08T16:00:29.784439Z	107061357 Execute	select * from `accounts` where (`user_id` = 1155 and `currency_id` = 1) limit 1 for update  2018-02-08T16:00:36.737472Z	107061357 Execute	update `accounts` set `balance` = '100' where `id` = 100500  2018-02-08T16:00:36.738622Z	107061357 Execute	select * from `stats` where (`user_id` = 1155 and `currency_id` is null and `date` = '2018-02-08') limit 1 for update # Select if exists and update, or insert otherwise  2018-02-08T16:00:36.740155Z	107061357 Execute	insert into `stats` (`user_id`, `currency_id`, `date`, `op_count`) values (303155, NULL, '2018-02-08', 1)  2018-02-08T16:00:38.004752Z	107061357 Execute	insert into `operations` (`user_id`, `currency_id`, `op_hash`, `amount`) values (303155, 2, 'XXXCCC', '100')  2018-02-08T16:00:38.128726Z	107061357 Query	COMMIT
                  Let me explain what's wrong with this transaction.
                  At the moment this transaction run there was record in stats table that correspondent to criteria in select on line 6, but for some reason this statement return 0 rows that led to insert on line 7 (according to application business logic). So that's first anomaly.
                  And second one is that update statement on line 5 was not applied on data, and it also not present neither in binlog on master no in general log on slave
                  Here is the binlog
                  Code:
                   
                   #180208 16:00:36 server id 2  end_log_pos 31456966 CRC32 0xd32e6f52 	Intvar SET INSERT_ID=1982809/*!*/; #180208 16:00:36 server id 2  end_log_pos 31457177 CRC32 0x87dd5690 	Query	thread_id=107061357	exec_time=1	error_code=0 SET TIMESTAMP=1518105636/*!*/; insert into `stats` (`user_id`, `currency_id`, `date`, `op_count`) values (1155, NULL, '2018-02-08', 1) /*!*/; # at 31457177 #180208 16:00:36 server id 2  end_log_pos 31457208 CRC32 0x11b139be 	Xid = 8755238199 COMMIT/*!*/; # at 31457208 #180208 16:00:38 server id 2  end_log_pos 31512632 CRC32 0x321e6aa7 	Query	thread_id=107061357	exec_time=0	error_code=0 SET TIMESTAMP=1518105638/*!*/; BEGIN /*!*/; # at 31512612 #180208 16:00:38 server id 2  end_log_pos 31512664 CRC32 0xe76dd851 	Intvar SET INSERT_ID=8113929/*!*/; #180208 16:00:38 server id 2  end_log_pos 31513127 CRC32 0xa0eb9a2b 	Query	thread_id=107061357	exec_time=0	error_code=0 SET TIMESTAMP=1518105648/*!*/; insert into `operations` (`user_id`, `currency_id`, `op_hash`, `amount`) values (1155, 1, 'XXXCCC', '100') /*!*/; # at 31513117
                  What we have found later that there was deadlock related to `stats` table
                  Here is INNODB STATUS
                  Code:
                   
                   ------------------------ LATEST DETECTED DEADLOCK ------------------------ 2018-02-08 16:44:08 0x7f1bf5e7b700 *** (1) TRANSACTION: TRANSACTION 443740480, ACTIVE 0 sec starting index read mysql tables in use 1, locked 1 LOCK WAIT 14 lock struct(s), heap size 1136, 8 row lock(s), undo log entries 4 MySQL thread id 107582382, OS thread handle 139758067050240, query id 8765564614 172.17.0.2 DB Sending data select * from `stats` where (`user_id` = 5116 and `currency_id` is null and `date` = '2018-02-08') limit 1 for update *** (1) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 149 page no 15866 n bits 736 index stats_date_user_id_currency_id_unique of table `DB`.`stats` trx id 443740480 lock_mode X waiting *** (2) TRANSACTION: TRANSACTION 443740461, ACTIVE 45 sec inserting mysql tables in use 1, locked 1 41 lock struct(s), heap size 8400, 45 row lock(s), undo log entries 31 MySQL thread id 107516770, OS thread handle 139758066448128, query id 8765564657 172.17.0.2 DB update insert into `operations` (`user_id`, `currency_id`, `op_hash`, `amount`) values (5116, 1, 'CCZXCZX', '100500') *** (2) HOLDS THE LOCK(S): RECORD LOCKS space id 149 page no 15866 n bits 736 index stats_date_user_id_currency_id_unique of table `DB`.`stats` trx id 443740461 lock_mode X *** (2) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 221 page no 2910 n bits 120 index PRIMARY of table `DB`.`users` trx id 443740461 lock mode S locks rec but not gap waiting *** WE ROLL BACK TRANSACTION (1)
                  Please note that schema has been simplified specially for this discussion.

                  Thanks for any help in advance!
                  Application for testing and sharing SQL queries.

                  Comment


                  • #10
                    Hello toon1
                    I have contacted some of the tech team here, they don't think you have encountered a bug on this occasion.
                    There are a lot of reasons you could be getting this behaviour, and it's very difficult to diagnose from here.
                    However, their suggestion is that perhaps your app isn't checking return codes on a lock failure/deadlock. Or it might be misinterpreting a error meaning.
                    Perhaps if you can investigate that first?

                    Comment

                    Working...
                    X