GET 24/7 LIVE HELP NOW

Announcement

Announcement Module
Collapse
No announcement yet.

Replication hangs after mysqldump / innobackupex from master

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Replication hangs after mysqldump / innobackupex from master

    Hi,

    currently we have some problems with the replication of the percona server (Server version: 5.5.27-28.1-log Percona Server (GPL), Release 28.1).
    The problem is that the slave stands still some seconds after the innobackupex backup was copied and the change master & start slave was executed, but only the Exec_Master_Log_Pos hangs, Relay_Log_Space is still increasing and in the file system the relay logs are increasing. The configuration (my.cnf) contains exactly the same configuration as the master, we tried also to change the binlog_format from RAW to MIXED but nothing changed (of course we did the reset master thing after the changes).

    We also tried a mysqldump import to the slave but nothing seems to solve the problem.

    While the slave is started the mysqld process consumes 100% of one CPU thread, the IO-load itself is low and the server runs with a high performance SSD RAID 5.

    Our configuration is as the following:
    key_buffer = 4096M
    max_allowed_packet = 128M
    read_buffer_size = 8M
    thread_stack = 512K
    table_cache = 48K
    myisam_sort_buffer = 70M
    join_buffer_size = 1024M
    tmp_table_size = 2048M
    max_heap_table_size = 50M
    max_connections = 300
    innodb_adaptive_flushing_method = keep_average
    myisam-recover = BACKUP
    query_cache_limit = 64M
    query_cache_size = 100M
    query_cache_type = 1
    server-id = 805
    log_bin = /var/lib/mysql/percona/logs/master_805
    expire_logs_days = 1
    max_binlog_size = 100M
    binlog_format = MIXED
    innodb_data_home_dir = /var/lib/mysql/percona/innodb/
    innodb_data_file_path = ibdata1:4096M:autoextend
    innodb_autoextend_increment = 4096M
    innodb_log_group_home_dir = /var/lib/mysql/percona/logs/
    innodb_buffer_pool_size = 8192M
    innodb_additional_mem_pool_size = 1024M
    innodb_log_file_size = 1024M
    innodb_log_buffer_size = 20M
    innodb_flush_log_at_trx_commit = 0
    innodb_lock_wait_timeout = 9600
    transaction-isolation = READ-COMMITTED
    innodb_thread_concurrency = 12
    innodb_file_per_table
    sort_buffer_size = 1024M
    net_write_timeout = 3600

    We're currently a bit perplexed and don't know how to fix this issue, so hopefully you expert guys have an idea what we can do to solve that problem.

    Any help would be really welcome!

    Thanks,

    Sebastian

  • #2
    Hi,

    Can you add skip-slave-start on your config, restart MySQL and execute change master to then start slave. See if it works?
    Our documentation has a lot of answers about common questions on Percona software, have you checked there before posting that question here? http://www.percona.com/forums/core/i...lies/smile.png

    Join us at the annual Percona Live MySQL Users Conference - http://www.percona.com/live/mysql-conference-2014/

    Comment


    • #3
      Hi,

      no, unfortunately this doesn't work, the Exec_Master_Log_Pos stands still and the Relay_Log increases + the Seconds Behind Master increases all the time. I tried also a reset slave all; and change master to the last position as well as a fresh innobackupex copy but the result is still the same.

      Additionally i want to add that we have round about 200 different databases at the master host, we don't restrict the database replication with do-db / ignore-db settings, etc.. Could the amount of databases produce this problem?

      Best,

      Sebastian

      Comment


      • #4
        Sebastian,

        Is it possible to post a SHOW ENGINE INNODB STATUS and SHOW PROCESSLIST here when it hangs?

        If you have the debuginfo package and GDB installed, a backtrace can also help, here is how you can get it:

        gdb --batch --quiet -ex 'set pagination off' -ex 'thread apply all bt full' -ex 'quit' -p $(pidof mysqld) | tee /root/backtrace.out
        Our documentation has a lot of answers about common questions on Percona software, have you checked there before posting that question here? http://www.percona.com/forums/core/i...lies/smile.png

        Join us at the annual Percona Live MySQL Users Conference - http://www.percona.com/live/mysql-conference-2014/

        Comment


        • #5
          Hi Revin,

          it seems that after the change the slave is running but very slow, show slave status moves forward even for the Exec_Master_Log_Pos and Relay_Master_Log_File but far away from beeing 0:

          Master_Log_File: master_1014.001441
          Read_Master_Log_Pos: 91679029
          Relay_Log_File: master5-relay-bin.000052
          Relay_Log_Pos: 18076535
          Relay_Master_Log_File: master_1014.001373
          Exec_Master_Log_Pos: 18076387
          Relay_Log_Space: 7537056782
          Seconds_Behind_Master: 49257

          The CPU load is still at 100%.

          Regarding your request, SHOW PROCESSLIST:
          SHOW PROCESSLIST:
          mysql> show full processlist;
          +-----+-------------+-----------+------+---------+-------+-- --------------------------------+-----------------------+--- --------+---------------+-----------+
          | Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined | Rows_read |
          +-----+-------------+-----------+------+---------+-------+-- --------------------------------+-----------------------+--- --------+---------------+-----------+
          | 2 | system user | | NULL | Connect | 46001 | Waiting for master to send event | NULL | 0 | 0 | 1 |
          | 3 | system user | | NULL | Connect | 50256 | Reading event from the relay log | NULL | 0 | 0 | 1 |
          | 613 | root | localhost | NULL | Query | 0 | NULL | show full processlist | 0 | 0 | 1 |
          +-----+-------------+-----------+------+---------+-------+-- --------------------------------+-----------------------+--- --------+---------------+-----------+
          3 rows in set (0.00 sec)


          SHOW ENGINE INNODB STATUS:
          =====================================
          121017 8:41:25 INNODB MONITOR OUTPUT
          =====================================
          Per second averages calculated from the last 10 seconds
          -----------------
          BACKGROUND THREAD
          -----------------
          srv_master_thread loops: 96270 1_second, 96256 sleeps, 868 10_second, 21549 background, 21549 flush
          srv_master_thread log flush and writes: 22529
          ----------
          SEMAPHORES
          ----------
          OS WAIT ARRAY INFO: reservation count 83479, signal count 315976
          Mutex spin waits 251429, rounds 1472546, OS waits 45871
          RW-shared spins 145298, rounds 1667835, OS waits 34246
          RW-excl spins 19071, rounds 498338, OS waits 3151
          Spin rounds per wait: 5.86 mutex, 11.48 RW-shared, 26.13 RW-excl
          --------
          FILE I/O
          --------
          I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
          I/O thread 1 state: waiting for completed aio requests (log thread)
          I/O thread 2 state: waiting for completed aio requests (read thread)
          I/O thread 3 state: waiting for completed aio requests (read thread)
          I/O thread 4 state: waiting for completed aio requests (read thread)
          I/O thread 5 state: waiting for completed aio requests (read thread)
          I/O thread 6 state: waiting for completed aio requests (write thread)
          I/O thread 7 state: waiting for completed aio requests (write thread)
          I/O thread 8 state: waiting for completed aio requests (write thread)
          I/O thread 9 state: waiting for completed aio requests (write thread)
          Pending normal aio reads: 0 [0, 0, 0, 0] , aio writes: 0 [0, 0, 0, 0] ,
          ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
          Pending flushes (fsync) log: 0; buffer pool: 0
          66388 OS file reads, 670941 OS file writes, 100620 OS fsyncs
          0.00 reads/s, 0 avg bytes/read, 3.40 writes/s, 1.00 fsyncs/s
          -------------------------------------
          INSERT BUFFER AND ADAPTIVE HASH INDEX
          -------------------------------------
          Ibuf: size 1, free list len 17, seg size 19, 4315 merges
          merged operations:
          insert 5162, delete mark 203, delete 10
          discarded operations:
          insert 0, delete mark 0, delete 0
          Hash table size 17700827, node heap has 838 buffer(s)
          770660.73 hash searches/s, 45843.42 non-hash searches/s
          ---
          LOG
          ---
          Log sequence number 200025236651
          Log flushed up to 200025233218
          Last checkpoint at 200025233218
          Max checkpoint age 1738042860
          Checkpoint age target 1683729021
          Modified age 3433
          Checkpoint age 3433
          0 pending log writes, 0 pending chkp writes
          38229 log i/o's done, 0.40 log i/o's/second
          ----------------------
          BUFFER POOL AND MEMORY
          ----------------------
          Total memory allocated 8824815616; in additional pool allocated 0
          Internal hash tables (constant factor + variable factor)
          Adaptive hash index 155340624 (141606616 + 13734008)
          Page hash 8851208 (buffer pool 0 only)
          Dictionary cache 43545566 (35403184 + 8142382)
          File system 12822872 (82672 + 12740200)
          Lock system 21724472 (21250568 + 473904)
          Recovery system 0 (0 + 0)
          Dictionary memory allocated 8142382
          Buffer pool size 524287
          Buffer pool size, bytes 8589918208
          Free buffers 393954
          Database pages 129467
          Old database pages 47773
          Modified db pages 12
          Pending reads 0
          Pending writes: LRU 0, flush list 0, single page 0
          Pages made young 150, not young 0
          0.00 youngs/s, 0.00 non-youngs/s
          Pages read 65133, created 64334, written 609060
          0.00 reads/s, 0.00 creates/s, 2.80 writes/s
          Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
          Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
          LRU len: 129467, unzip_LRU len: 0
          I/O sum[0]:cur[0], unzip sum[0]:cur[0]
          --------------
          ROW OPERATIONS
          --------------
          0 queries inside InnoDB, 0 queries in queue
          1 read views open inside InnoDB
          ---OLDEST VIEW---
          Normal read view
          Read view low limit trx n 1E9B023
          Read view up limit trx id 1E9B022
          Read view low limit trx id 1E9B023
          Read view individually stored trx ids:
          Read view trx id 1E9B022
          -----------------
          Main thread process no. 30888, id 140470070490880, state: sleeping
          Number of rows inserted 1525133, updated 400036, deleted 832274, read 19326277557
          0.00 inserts/s, 0.00 updates/s, 8.70 deletes/s, 385479.45 reads/s
          ------------
          TRANSACTIONS
          ------------
          Trx id counter 1E9B023
          Purge done for trx's n < 1E9B022 undo n < 0
          History list length 826
          LIST OF TRANSACTIONS FOR EACH SESSION:
          ---TRANSACTION 0, not started
          MySQL thread id 613, OS thread handle 0x7fc4e9da5700, query id 1409762 localhost root
          show engine innodb status
          ---TRANSACTION 1E9B022, ACTIVE 551 sec fetching rows
          mysql tables in use 1, locked 1
          5142 lock struct(s), heap size 473528, 107040 row lock(s), undo log entries 9043
          MySQL thread id 3, OS thread handle 0x7fc4e9ca3700, query id 1409742 Reading event from the relay log
          TABLE LOCK table `main`.`log_analyzer_parsed` trx id 1E9B022 lock mode IX
          RECORD LOCKS space id 29270 page no 5566 n bits 936 index `PRODUCT_ID` of table `master`.`log_analyzer_parsed` trx id 1E9B022 lock_mode X locks rec but not gap
          RECORD LOCKS space id 29270 page no 560 n bits 96 index `GEN_CLUST_INDEX` of table `master`.`log_analyzer_parsed` trx id 1E9B022 lock_mode X locks rec but not gap
          RECORD LOCKS space id 29270 page no 5535 n bits 776 index `PRODUCT_ID` of table `master`.`log_analyzer_parsed` trx id 1E9B022 lock_mode X locks rec but not gap
          RECORD LOCKS space id 29270 page no 698 n bits 96 index `GEN_CLUST_INDEX` of table `master`.`log_analyzer_parsed` trx id 1E9B022 lock_mode X locks rec but not gap
          RECORD LOCKS space id 29270 page no 723 n bits 96 index `GEN_CLUST_INDEX` of table `master`.`log_analyzer_parsed` trx id 1E9B022 lock_mode X locks rec but not gap
          RECORD LOCKS space id 29270 page no 825 n bits 96 index `GEN_CLUST_INDEX` of table `master`.`log_analyzer_parsed` trx id 1E9B022 lock_mode X locks rec but not gap
          RECORD LOCKS space id 29270 page no 916 n bits 96 index `GEN_CLUST_INDEX` of table `master`.`log_analyzer_parsed` trx id 1E9B022 lock_mode X locks rec but not gap
          RECORD LOCKS space id 29270 page no 1023 n bits 96 index `GEN_CLUST_INDEX` of table `master`.`log_analyzer_parsed` trx id 1E9B022 lock_mode X locks rec but not gap
          RECORD LOCKS space id 29270 page no 1164 n bits 96 index `GEN_CLUST_INDEX` of table `master`.`log_analyzer_parsed` trx id 1E9B022 lock_mode X locks rec but not gap
          TOO MANY LOCKS PRINTED FOR THIS TRX: SUPPRESSING FURTHER PRINTS
          ----------------------------
          END OF INNODB MONITOR OUTPUT
          ============================


          I'll add the GDB things asap.

          Best,

          Sebastian

          Comment


          • #6
            ---TRANSACTION 1E9B022, ACTIVE 551 sec fetching rows
            mysql tables in use 1, locked 1
            5142 lock struct(s), heap size 473528, 107040 row lock(s), undo log entries 9043
            MySQL thread id 3, OS thread handle 0x7fc4e9ca3700, query id 1409742 Reading event from the relay log

            There you have it a long running event thus the exec_master_log_pos appears to be stuck. Thread id 3 is the sql thread in the processlist. You should let this finish being row based and if the event is quite large it can take some time. You can also check the relay log or binlog from the master with mysqlbinlog to see what events these are.
            Our documentation has a lot of answers about common questions on Percona software, have you checked there before posting that question here? http://www.percona.com/forums/core/i...lies/smile.png

            Join us at the annual Percona Live MySQL Users Conference - http://www.percona.com/live/mysql-conference-2014/

            Comment


            • #7
              revin wrote on Tue, 16 October 2012 22:19
              Sebastian,

              Is it possible to post a SHOW ENGINE INNODB STATUS and SHOW PROCESSLIST here when it hangs?

              If you have the debuginfo package and GDB installed, a backtrace can also help, here is how you can get it:

              gdb --batch --quiet -ex 'set pagination off' -ex 'thread apply all bt full' -ex 'quit' -p $(pidof mysqld) | tee /root/backtrace.out

              Comment


              • #8
                Revin,

                we checked the things and found the corresponding table which is usually updated from a lot of different threads at the master where it is no problem. The question is now, how can we solve the huge delay during the replication? The hardware setup is exactly the same as the master server and the master is working really fine and fast and as we thought that the slave, especially in RAW mode, should be much faster to handle updates.

                Best,

                Sebastian

                Comment


                • #9
                  Sebastian,

                  So has the long running event I previously pointed out finished and now the slave is just lagging behind? You can try setting sync_binlog = 0 on the slave to see if this helps to reduce IO per event committed.
                  Our documentation has a lot of answers about common questions on Percona software, have you checked there before posting that question here? http://www.percona.com/forums/core/i...lies/smile.png

                  Join us at the annual Percona Live MySQL Users Conference - http://www.percona.com/live/mysql-conference-2014/

                  Comment


                  • #10
                    This problem illustrates a big downside to native MySQL replication; the fact that it is single threaded, and that events must be processed one by one. So while your master may be doing fine running the work load using many threads, your slave is having trouble keeping up since all of the events become serialized into one thread.

                    If this behavior is common with your workload, then you may want to look into a third party replication solution like Tungsten Replicator, which helps in parallelizing event processing on the slave.

                    Comment


                    • #11
                      In our case the long running 'Reading event from the relay log' is locking and locking rows on a table that we know eventually gets dropped. As opposed to waiting a few days for it to finish locking all rows, can it be killed? When I 'stop slave sql_thread' and then start it, the transaction starts the locks from 1 again... Would it be safe to set-global-sql-slave-skip-counter one ahead and then start?

                      Comment


                      • #12
                        Increasing the sql_slave_skip_counter variable is fine as far as the system is concerned; the real question is whether or not it is okay for your slave database to be out of sync with your master. If that is acceptable to you, then what you suggest will solve your current issue (but may cause issues down the road due to data differences).

                        Ultimately you would want to look into the query that is doing the locking, and see if there are any optimizations / tuning that can be done to prevent the delay, such as indexing on the table in question.

                        Comment

                        Working...
                        X