Causes and Workarounds for Slave Performance Too Slow with Row-Based Events

Recently I worked on one customer issue that I would describe as “slave performance too slow”. During a quick analysis, I’ve found that the replication slave SQL thread cannot keep up while processing row-based events from the master’s binary log.

For example:

mysql> SHOW SLAVE STATUSG
*************************** 1. row ***************************
                          ...
              Master_Log_File: binlog.0000185
          Read_Master_Log_Pos: 86698585
                          ...
        Relay_Master_Log_File: binlog.0000185
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
                          ...
          Exec_Master_Log_Pos: 380
              Relay_Log_Space: 85699128
                          ...
                  Master_UUID: 98974e7f-2fbc-18e9-72cd-07003817585c
                          ...
           Retrieved_Gtid_Set: 98974e7f-2fbc-18e9-72cd-07003817585c:1055-1057
            Executed_Gtid_Set: 7f42e2c5-3fbc-16e7-7fb8-05003715789a:1-2,
98974e7f-2fbc-18e9-72cd-07003817585c:1-1056
                          ...

mysql> SHOW SLAVE STATUSG

*************************** 1. row ***************************

...

Master_Log_File: binlog.0000185

Read_Master_Log_Pos: 86698585

...

Relay_Master_Log_File: binlog.0000185

Slave_IO_Running: Yes

Slave_SQL_Running: Yes

...

Exec_Master_Log_Pos: 380

Relay_Log_Space: 85699128

...

Master_UUID: 98974e7f-2fbc-18e9-72cd-07003817585c

...

Retrieved_Gtid_Set: 98974e7f-2fbc-18e9-72cd-07003817585c:1055-1057

Executed_Gtid_Set: 7f42e2c5-3fbc-16e7-7fb8-05003715789a:1-2,

98974e7f-2fbc-18e9-72cd-07003817585c:1-1056

...

The processlist state for the SQL thread can be one of the following: Reading event from the relay log, or System lock, or potentially some other state. In my case:

mysql> SHOW PROCESSLIST;
+----+-----------------+-----------------+------+---------+------+----------------------------------+------------------+
| Id | User            | Host            | db   | Command | Time | State                            | Info             |
+----+-----------------+-----------------+------+---------+------+----------------------------------+------------------+
...
|  4 | system user     |                 | NULL | Connect |  268 | Reading event from the relay log | NULL             |
...
+----+-----------------+-----------------+------+---------+------+----------------------------------+------------------+

mysql> SHOW PROCESSLIST;

+----+-----------------+-----------------+------+---------+------+----------------------------------+------------------+

+----+-----------------+-----------------+------+---------+------+----------------------------------+------------------+

...

...

+----+-----------------+-----------------+------+---------+------+----------------------------------+------------------+

What causes that?

Let’s take a look what could potentially cause such behavior and what we need to pay attention to. When the SQL thread applies the change from a row-based event, it has to locate the exact row that was updated. With a primary key, this is trivial as only one row can possibly have the same value for the primary key.

However, if there is no primary key on the table on the replication slave side, the SQL thread must search the entire table to locate the row to update or delete. It repeats the search for each updated row. This search is both very resource usage intensive (CPU usage can be up to 100%) and slow causing the slave to fall behind.

For InnoDB tables, the “hidden” key used for the clustered index for tables without a primary key cannot be used to avoid searching the entire table for the rows to update or delete. We need to keep in mind that the “hidden” key is unique only to each MySQL instance, so the replication master and replication slave generally don’t have the same values for the “hidden” key for the same row.

What can we do to solve that?

The best solution is to ensure that all tables have a primary key. This not only ensures the SQL thread can easily locate rows to update or delete, but it is also considered as a best practice since it ensures all rows are unique.

If there is no way to logically add a natural primary key for the table, a potential solution is to add an auto-increment unsigned integer column as the primary key.

The query below helps you to locate tables without a primary key:

SELECT tables.table_schema, tables.table_name, tables.table_rows
      FROM information_schema.tables
      LEFT JOIN (
        SELECT table_schema, table_name
        FROM information_schema.statistics
        GROUP BY table_schema, table_name, index_name
        HAVING
          SUM(
            CASE WHEN non_unique = 0 AND nullable != 'YES' THEN 1 ELSE 0 END
          ) = COUNT(*)
      ) puks
      ON tables.table_schema = puks.table_schema AND tables.table_name = puks.table_name
      WHERE puks.table_name IS NULL
        AND tables.table_schema NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys')
        AND tables.table_type = 'BASE TABLE' AND engine='InnoDB';

SELECT tables.table_schema, tables.table_name, tables.table_rows

FROM information_schema.tables

LEFT JOIN (

SELECT table_schema, table_name

FROM information_schema.statistics

GROUP BY table_schema, table_name, index_name

HAVING

SUM(

CASE WHEN non_unique = 0 AND nullable != 'YES' THEN 1 ELSE 0 END

) = COUNT(*)

) puks

ON tables.table_schema = puks.table_schema AND tables.table_name = puks.table_name

WHERE puks.table_name IS NULL

AND tables.table_schema NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys')

AND tables.table_type = 'BASE TABLE' AND engine='InnoDB';

Please note that for InnoDB, there must always be a unique NOT NULL key for all tables. It is required for the clustered index. So adding an explicit “dummy” column as suggested above will not add to the overall storage requirements as it will merely replace the hidden key.

It’s not always possible to add a primary key to the table immediately if, for example, there are many relations on the application side/legacy system, lack of resources, unknown application behavior after the change which required testing, etc.

In this case, a short-term solution is to change the search algorithm used by the replication slave to locate the rows changed by row-based events.

The search algorithm is set using the slave_rows_search_algorithms option which is available in MySQL 5.6 and later. The default value is to use an index scan if possible, otherwise a table scan.

https://dev.mysql.com/doc/refman/5.7/en/replication-options-slave.html#option_mysqld_slave-rows-search-algorithms

However, for tables without a primary key using a hash scan, which causes the SQL thread to temporarily cache hashes to reduce the overhead of searching the whole table. The value of slave_rows_search_algorithms can be changed dynamically using:

mysql> SET GLOBAL slave_rows_search_algorithms = 'INDEX_SCAN,HASH_SCAN';

1	mysql> SET GLOBAL slave_rows_search_algorithms = 'INDEX_SCAN,HASH_SCAN';

Just to note INDEX_SCAN,HASH_SCAN is the default value in MySQL 8.0.

One thing to be aware of when using hash scans that the hashes are only reused within one row-based event. (Each row-based event may have changes to several rows in the same table originating from the same SQL statement).

The binlog_row_event_max_size option on the replication master controls the maximum size of a row-based event. The default max event size is 8kB. This means that switching to hash scans only improves the performance of the SQL thread when:

Several rows fit into one row based event. It may help to increase the value of binlog_row_event_max_size on the replication master, if you perform updates or deletes on large rows (e.g., with blob or text data). You can only set the binlog_row_event_max_size in the MySQL configuration file, and resetting this value requires a restart.
One statement changes several rows.

https://dev.mysql.com/doc/refman/5.7/en/replication-options-binary-log.html#option_mysqld_binlog-row-event-max-size

Conclusion

Even if enabling hash scans improves the performance enough for the replication slave to keep up, the permanent solution is to add an explicit primary key to each table. This should be the general rule of thumb in the schema design in order avoid and/or minimize many issues like slave performance too slow (as described in this post).

Next, I am going to investigate how we can find out the exact thread state using Performance Schema in order to make issue identification less of a guessing game.

2 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Alexander Pachev

7 years ago

I have found it easier to find problem threads by attaching to MySQL with gdb and backtracing all threads (poor man profiler method). Then you get to see what mysqld process is actually doing. It is particularly helpful in the case of a hard deadlock when mysqld is unresponsive to commands or even attempts to connect. By examining the registers you can figure out who is holding the mutex that some thread is trying to get and is stuck on, or quickly see that writes cannot progress because the disk is full, or see that the process is stuck in I/O that never terminates, etc.

7 years ago

We had a similar problem with third party tool that had a lot of tables without primary key.
We workaround it by adding arbitrary index on one of the table columns.

MySQL 5.7
Support

Compare Percona to Leading Database Solutions

Software
Downloads

Valkey Contribution

Product Documentation

Resource Hub

Why Percona for MongoDB?

Why Percona for PostgreSQL?

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Causes and Workarounds for Slave Performance Too Slow with Row-Based Events

What causes that?

What can we do to solve that?

Conclusion

Related Blog Articles

RECOMMENDED ARTICLES

Importance of Tuning Checkpoint in PostgreSQL

Rebuilding a Replica with MyDumper

Databases, Data Lakes, And Encryption

MOST POPULAR ARTICLES

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL Performance Tuning: Maximizing Database Efficiency and Speed

The Ultimate Guide to Open Source Databases

MySQL 5.7 Support

Compare Percona to Leading Database Solutions

Software Downloads

Valkey Contribution

Product Documentation

Resource Hub

Why Percona for MongoDB?

Why Percona for PostgreSQL?

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Causes and Workarounds for Slave Performance Too Slow with Row-Based Events

What causes that?

What can we do to solve that?

Conclusion

Share This Post!

Stay up to date with the Percona Blog

Related Blog Articles

RECOMMENDED ARTICLES

Importance of Tuning Checkpoint in PostgreSQL

Rebuilding a Replica with MyDumper

Databases, Data Lakes, And Encryption

MOST POPULAR ARTICLES

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL Performance Tuning: Maximizing Database Efficiency and Speed

The Ultimate Guide to Open Source Databases

MySQL 5.7
Support

Software
Downloads