GET 24/7 LIVE HELP NOW

Announcement

Announcement Module
Collapse
No announcement yet.

High IO observed on one half of master-master setup

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • High IO observed on one half of master-master setup

    Hi All,

    We have two data centers, an XtraDB 5.1 server in each, with a master-master replication between them.

    Load is spread dynamically, sending an increased load to the datacenter which displays the lowest latency in the fronting web application. Load is high, but plenty of hardware has been thrown at it and general performance is good.

    One datacenter (lets call it DC1) receives the majority of the traffic -- averaging about 20% more than the other (DC2) due to the dynamic load splitting.

    However, I consistently observe a higher IO load on the DC2 server. This can be seen mirrored in the InnoDB_Buffer_Pool_Write_Requests -- averaging about 20% higher than DC1. During absolute peak periods this is starting to cause issues as the pages aren't flushed fast enough, get beyond their age limit, the aggressive flushing algorithm kicks in and write IO gets a bit out of control.

    We have some options to speed up our IO (better fs configuration, some changed MySQL config, some application issues), which we are currently pursuing.

    However I'd like to know if this is a normal pattern? Are somehow more buffer pool writes caused by replicated writes, rather than direct writes? Is this a symptom anyone has seen before, or has any ideas about?

    Thanks in advance,

    Luke.

    Edit: Should add that this is fully statement-based replication.

  • #2
    Where are the slave relay logs placed on the machines?
    Are they placed on the same volume as the innodb table space?

    I can't say why you are having higher InnoDB_Buffer_Pool_Write_Requests on the "slave"

    But the IO load could be higher on the "slave" than the "master" since the "master" only have to:
    And write data to binary log file
    Write data to InnoDB

    While the slave will need to:
    Write the data to the relay log file. (IO_THREAD fetches from master)
    Read the data from the relay log file. (SQL_THREAD reads statements)
    Write it to InnoDB. (SQL_THREAD executes statements)


    So in total you might get more IO happening on the "slave" than the "master".
    But I don't know how that can be translated to the InnoDB writes since I think they should be the same for the two servers.

    Hope you find an answer and can post it here for us to learn.

    Comment


    • #3
      The binary and relay logs are (unfortunately) on the same filesystem, however as the servers are multi-master the difference between each server shouldn't be quite as stark as I'm observing. Also, the IO discrepancy is easily accounted for by the different InnoDB_Buffer_Pool_Write_Requests -- which as you said should not be different between both servers.

      I'm trying to organise a few hours of strict 50/50 load balancing, rather than the current dynamic setup. Hopefully that will prove if the difference is related to the frontend load or not.

      L.

      Comment


      • #4
        Quick update. I started logging and graphing more metrics: COM_INSERT, COM_UPDATE, COM_DELETE, COM_SELECT, COM_COMMIT, COM_ROLLBACK.

        It is now obvious that the discrepancy is caused by one server running many more DELETE statements than the other. All other counters are identical, excepting COM_COMMIT and COM_ROLLBACK which both are relative to the external load applied to each server.

        I still have no idea why these delete statements exist, but hopefully I'm nearly there. It seems that these extra delete statements are not in any binlog, anywhere: they are only showing up as an incremented COM_DELETE on one server but not the other. Plus a whole lot more I/O.

        We have no "binlog-do-db" or similar filtering, and I've checked that all constraints, triggers and procedures are identical between the servers.

        Any ideas of possible DELETE statements that are not binlogged???

        Comment


        • #5
          Final update -- I found the issue.

          A data maintenance script gone rogue, running with SQL_LOG_BIN=0 in the session

          No no mysterious weird replication behaviour.

          Comment

          Working...
          X