Causes and Workarounds for Slave Performance Too Slow with Row-Based Events

Causes and Workarounds for Slave Performance Too Slow with Row-Based Events

PREVIOUS POST
NEXT POST

Slave Performance Too SlowRecently I worked on one customer issue that I would describe as “slave performance too slow”. During a quick analysis, I’ve found that the replication slave SQL thread cannot keep up while processing row-based events from the master’s binary log.

For example:

The processlist state for the SQL thread can be one of the following: Reading event from the relay log, or System lock, or potentially some other state. In my case:

What causes that?

Let’s take a look what could potentially cause such behavior and what we need to pay attention to. When the SQL thread applies the change from a row-based event, it has to locate the exact row that was updated. With a primary key, this is trivial as only one row can possibly have the same value for the primary key.

However, if there is no primary key on the table on the replication slave side, the SQL thread must search the entire table to locate the row to update or delete. It repeats the search for each updated row. This search is both very resource usage intensive (CPU usage can be up to 100%) and slow causing the slave to fall behind.

For InnoDB tables, the “hidden” key used for the clustered index for tables without a primary key cannot be used to avoid searching the entire table for the rows to update or delete. We need to keep in mind that the “hidden” key is unique only to each MySQL instance, so the replication master and replication slave generally don’t have the same values for the “hidden” key for the same row.

What can we do to solve that?

The best solution is to ensure that all tables have a primary key. This not only ensures the SQL thread can easily locate rows to update or delete, but it is also considered as a best practice since it ensures all rows are unique.

If there is no way to logically add a natural primary key for the table, a potential solution is to add an auto-increment unsigned integer column as the primary key.

The query below helps you to locate tables without a primary key:

Please note that for InnoDB, there must always be a unique NOT NULL key for all tables. It is required for the clustered index. So adding an explicit “dummy” column as suggested above will not add to the overall storage requirements as it will merely replace the hidden key.

It’s not always possible to add a primary key to the table immediately if, for example, there are many relations on the application side/legacy system, lack of resources, unknown application behavior after the change which required testing, etc.

In this case, a short-term solution is to change the search algorithm used by the replication slave to locate the rows changed by row-based events.

The search algorithm is set using the slave_rows_search_algorithms option which is available in MySQL 5.6 and later. The default value is to use an index scan if possible, otherwise a table scan.

https://dev.mysql.com/doc/refman/5.7/en/replication-options-slave.html#option_mysqld_slave-rows-search-algorithms

However, for tables without a primary key using a hash scan, which causes the SQL thread to temporarily cache hashes to reduce the overhead of searching the whole table. The value of slave_rows_search_algorithms can be changed dynamically using:


Just to note INDEX_SCAN,HASH_SCAN is the default value in MySQL 8.0.

One thing to be aware of when using hash scans that the hashes are only reused within one row-based event. (Each row-based event may have changes to several rows in the same table originating from the same SQL statement).

The binlog_row_event_max_size option on the replication master controls the maximum size of a row-based event. The default max event size is 8kB. This means that switching to hash scans only improves the performance of the SQL thread when:

  1. Several rows fit into one row based event. It may help to increase the value of binlog_row_event_max_size on the replication master, if you perform updates or deletes on large rows (e.g., with blob or text data). You can only set the binlog_row_event_max_size in the MySQL configuration file, and resetting this value requires a restart.
  2. One statement changes several rows.

https://dev.mysql.com/doc/refman/5.7/en/replication-options-binary-log.html#option_mysqld_binlog-row-event-max-size

Conclusion

Even if enabling hash scans improves the performance enough for the replication slave to keep up, the permanent solution is to add an explicit primary key to each table. This should be the general rule of thumb in the schema design in order avoid and/or minimize many issues like slave performance too slow (as described in this post).

Next, I am going to investigate how we can find out the exact thread state using Performance Schema in order to make issue identification less of a guessing game.

PREVIOUS POST
NEXT POST

Share this post

Comments (2)

  • Alexander Pachev Reply

    I have found it easier to find problem threads by attaching to MySQL with gdb and backtracing all threads (poor man profiler method). Then you get to see what mysqld process is actually doing. It is particularly helpful in the case of a hard deadlock when mysqld is unresponsive to commands or even attempts to connect. By examining the registers you can figure out who is holding the mutex that some thread is trying to get and is stuck on, or quickly see that writes cannot progress because the disk is full, or see that the process is stuck in I/O that never terminates, etc.

    May 3, 2018 at 7:44 pm
  • YK Reply

    We had a similar problem with third party tool that had a lot of tables without primary key.
    We workaround it by adding arbitrary index on one of the table columns.

    May 5, 2018 at 5:32 pm

Leave a Reply