innodb_autoinc_lock_mode with SBR and multi-row inserts

  • Filter
  • Time
  • Show
Clear All
new posts

  • innodb_autoinc_lock_mode with SBR and multi-row inserts

    I've been trying to hunt down this issue for a few days and believe that I have found an issue. Here is my setup for the particular cluster I have been having issues with:

    4 Total Servers
    - 2 Masters in Active/Passive M-M
    - 2 slaves (each off of 1 master)
    - All running Percona MySQL 5.1.54
    - Statement Based Replication
    - All writes go to single master - replicated to passive master and primary slave

    Documentation says that when using SBR and you know the number of rows to be inserted and those are all the same type, innodb_autoinc_lock_mode 0 OR 1 should be safe. I have been using mode 1 (default) and have seen my slaves get out of sync rapidly (when I define a few skip-slaves) or die on duplicate primary key. I verified that the only user connecting was the mysql system user (replication thread) and found that the PKs were different on the master and the slave.

    The query that is causing the issue is a multi row insert, where no PKs are specified, so I would assume it is not multi-type (this could be the problem)?

    Here is some sample data (with column names changed):
    Master Data

    pk, fk_1, fk_2, fk_3, att_1, att_2, fk_4, att_3, att_4
    885357437, 105516509, 186757, 230399771, '0', 'new', 0, 2011-02-21 11:57:15, 1298311035
    885357431, 2453251, 186757, 230399771, '1', 'sent', 0, 2011-02-21 11:57:15, 1298311035

    Master Binlog Events:

    'mysql-bin.000047', 214491072, 'Query', 5001, 214491155, 'BEGIN'
    'mysql-bin.000047', 214491155, 'Intvar', 5001, 214491183, 'INSERT_ID=885357431'
    'mysql-bin.000047', 214491183, 'Query', 5001, 214491470, 'use `my_db`; INSERT INTO `table_1` (fk_1, fk_2, fk_3, att_1, att_2, att_4) VALUES ('2453251', 186757, 230399771, 1, 'sent', 1298311035),('105516509', 186757, 230399771, 0, 'new', 1298311035)'
    'mysql-bin.000047', 214491470, 'Xid', 5001, 214491497, 'COMMIT /* xid=12719492 */'


    'mysql-bin.000047', 214494305, 'Query', 5001, 214494388, 'BEGIN'
    'mysql-bin.000047', 214494388, 'Intvar', 5001, 214494416, 'INSERT_ID=885357433'
    'mysql-bin.000047', 214494416, 'Query', 5001, 214494704, 'use `my_db`; INSERT INTO `table_1` (fk_1, fk_2, fk_3, att_1, att_2, att_4) VALUES ('105620951', 186757, 230399773, 1, 'sent', 1298311040),('38049876', 186757, 230399773, 0, 'new', 1298311040)'
    'mysql-bin.000047', 214494704, 'Xid', 5001, 214494731, 'COMMIT /* xid=12719926 */'

    Slave Data:

    885357433, 105516509, 186757, 230399771, '0', 'new', 0, 2011-02-21 11:57:15, 1298311035
    885357431, 2453251, 186757, 230399771, '1', 'sent', 0, 2011-02-21 11:57:15, 1298311035

    As you can see, the slave got the event with 2 rows and did the inserts with 2, sequential IDs (as expected). However, the master did not and then later tried to do another insert with the already used PK and killed replication.

    After changing the mode from 1 to 0 (traditional using table-level AUTO-INC locks), this issue has vanished. I'm wondering if this is a Percona/MySQL/XtraDB bug or if my inserts are technically considered multi-type (in which case documentation is correct in saying it is not safe for mode 1).

    Any thoughts or comments would be appreciated!


  • #2

    Do you have some repeatable test case which can be tried easily ?

    We have not changed replication code which should cause such bug to appear in Percona Server but everything is possible when it comes to software engineering. I suggest you to get the test case and see if this is Percona Server issue if it is we'll fix it


    • #3

      Thanks for the quick reply. Currently, I don't have any particular SQL that can replicate the exact issue. The problem generally appears with multiple threads writing to the same table at the same time. In our production environment, the table in question contains roughly 100 million rows and we see an average of 10-15 writes per second with spikes approaching 1000 per second.

      I'll try to put together a little script/app that can simulate multiple threads writing to the same table structure at random, but overlapping times with both values for that option and post it here.


      • #4
        Yup. I know such race condition bugs can be pain to catch. This is especially why creating a test case is important as if you can't easily repeat the problem
        so it is hard for developers.


        • #5
          To this point, I have been unable to replicate this issue outside of my production environment.

          The only difference between my test environment and production (aside from lots of RAM and disk space) would be a trigger on the table that was causing the issues. I'm planning on adding some trigger logic to my simulation next.

          I've gotten sidetracked over the last week with other projects, but hope to have some time to revisit this shortly.

          I am pleased to report, however, that since switching to the "traditional" autoinc_lock_mode, we have been running flawlessly on 5.1.54 for almost 2 full weeks.


          • #6
            I suggest that you try to reproduce this with some statements that are intentionally interleaved, such as

            insert into t... values (.... sleep(10) ....), (..... sleep(10) .......)

            That could give you a chance to insert another row into the table between the sleeps. I'm not sure that this will work, but it seems worth trying. I agree with Peter that this is likely a bug in MySQL that we haven't interacted with; I have heard rumors of similar catch-22 situations from others.


            • #7
              I meant to add, the rumors include a Percona client who ran into this on standard MySQL. (They diagnosed and reported it to me, but I didn't confirm it.)


              • #8
                Baron, thanks for the suggestion. It turns out that the missing piece to my simulation was actually having a trigger on the table after insert.

                I have successfully been able to reproduce this on a single box. I'm going to verify that it does indeed break replication as I was seeing in my production environment. Also, I'm going to verify that the behavior is corrected when changing the autoinc mode.

                Once I've verified that I can always trigger this result, I'll clean up my sample scripts/sql and post them here. Hopefully, I can replicate this with an empty table so I don't need to send a full 10 million row table dump

                Anyway, there is progress and light at the end of the tunnel in my attempts to replicate this.

                As far as submitting a sample case, I have a simple multi-threaded python script that simulates the inserts and some sql to set up the db. Will that work or do you need anything else as well?


                • #9
                  Well, it turns out the along with those 2 options, the key is also the trigger on the table.

                  I have greatly simplified my sample schema to a single table with just a couple convenience indexes (for tracking the issue).

                  The attached python script basically creates the table and, if specified, creates the triggers on the table as well. It then generates a user selected number of inserts, each with a random number of rows (capped by a user defined max) for a user defined number of threads.

                  The trigger is simple - it just does a sleep after each row is inserted. This is used to simulate a trigger actually taking some time when working on tables in the 10s of millions of rows plus.

                  More details on the simulation:

                  Single master and single slave (both running Percona 5.1.54)
                  Scripts connects to and runs queries against master only

                  The following conditions DO NOT result in slave error:
                  1. SBR, autoinc = 0, 1 or more threads (with or without trigger)
                  2. SBR, autoinc = 1, 1 thread (with trigger) SBR, autoinc = 1, multiple threads (without trigger)

                  Here is the condition that has consistently caused a slave error in my testing:

                  SBR, innodb_autoinc_lock_mode = 1, Multiple threads, Trigger on table

                  Here is a sample usage of the script and there are some other notes in the script comments:

                  python sequential_ids.py --host=localhost --user=testuser --password=testpass --threads=50 --inserts-per-thread=20

                  python sequential_ids.py --help will display the full list of options.