Recently I worked on one customer issue that I would describe as “slave performance too slow”. During a quick analysis, I’ve found that the replication slave SQL thread cannot keep up while processing row-based events from the master’s binary log.
For example:
|
1 |
mysql> SHOW SLAVE STATUSG<br>*************************** 1. row ***************************<br> ...<br> Master_Log_File: binlog.0000185<br> Read_Master_Log_Pos: 86698585<br> ...<br> Relay_Master_Log_File: binlog.0000185<br> Slave_IO_Running: Yes<br> Slave_SQL_Running: Yes<br> ...<br> Exec_Master_Log_Pos: 380<br> Relay_Log_Space: 85699128<br> ...<br> Master_UUID: 98974e7f-2fbc-18e9-72cd-07003817585c<br> ...<br> Retrieved_Gtid_Set: 98974e7f-2fbc-18e9-72cd-07003817585c:1055-1057<br> Executed_Gtid_Set: 7f42e2c5-3fbc-16e7-7fb8-05003715789a:1-2,<br>98974e7f-2fbc-18e9-72cd-07003817585c:1-1056<br> ... <br> |
The processlist state for the SQL thread can be one of the following: Reading event from the relay log, or System lock, or potentially some other state. In my case:
|
1 |
mysql> SHOW PROCESSLIST;<br>+----+-----------------+-----------------+------+---------+------+----------------------------------+------------------+<br>| Id | User | Host | db | Command | Time | State | Info |<br>+----+-----------------+-----------------+------+---------+------+----------------------------------+------------------+<br>...<br>| 4 | system user | | NULL | Connect | 268 | Reading event from the relay log | NULL |<br>...<br>+----+-----------------+-----------------+------+---------+------+----------------------------------+------------------+ |
Let’s take a look what could potentially cause such behavior and what we need to pay attention to. When the SQL thread applies the change from a row-based event, it has to locate the exact row that was updated. With a primary key, this is trivial as only one row can possibly have the same value for the primary key.
However, if there is no primary key on the table on the replication slave side, the SQL thread must search the entire table to locate the row to update or delete. It repeats the search for each updated row. This search is both very resource usage intensive (CPU usage can be up to 100%) and slow causing the slave to fall behind.
The best solution is to ensure that all tables have a primary key. This not only ensures the SQL thread can easily locate rows to update or delete, but it is also considered as a best practice since it ensures all rows are unique.
If there is no way to logically add a natural primary key for the table, a potential solution is to add an auto-increment unsigned integer column as the primary key.
The query below helps you to locate tables without a primary key:
|
1 |
SELECT tables.table_schema, tables.table_name, tables.table_rows<br> FROM information_schema.tables<br> LEFT JOIN (<br> SELECT table_schema, table_name<br> FROM information_schema.statistics<br> GROUP BY table_schema, table_name, index_name<br> HAVING<br> SUM(<br> CASE WHEN non_unique = 0 AND nullable != 'YES' THEN 1 ELSE 0 END<br> ) = COUNT(*)<br> ) puks<br> ON tables.table_schema = puks.table_schema AND tables.table_name = puks.table_name<br> WHERE puks.table_name IS NULL<br> AND tables.table_schema NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys')<br> AND tables.table_type = 'BASE TABLE' AND engine='InnoDB'; |
Please note that for InnoDB, there must always be a unique NOT NULL key for all tables. It is required for the clustered index. So adding an explicit “dummy” column as suggested above will not add to the overall storage requirements as it will merely replace the hidden key.
It’s not always possible to add a primary key to the table immediately if, for example, there are many relations on the application side/legacy system, lack of resources, unknown application behavior after the change which required testing, etc.
In this case, a short-term solution is to change the search algorithm used by the replication slave to locate the rows changed by row-based events.
The search algorithm is set using the slave_rows_search_algorithms option which is available in MySQL 5.6 and later. The default value is to use an index scan if possible, otherwise a table scan.
However, for tables without a primary key using a hash scan, which causes the SQL thread to temporarily cache hashes to reduce the overhead of searching the whole table. The value of slave_rows_search_
|
1 |
mysql> SET GLOBAL slave_rows_search_algorithms = 'INDEX_SCAN,HASH_SCAN'; |
One thing to be aware of when using hash scans that the hashes are only reused within one row-based event. (Each row-based event may have changes to several rows in the same table originating from the same SQL statement).
The binlog_row_event_max_size option on the replication master controls the maximum size of a row-based event. The default max event size is 8kB. This means that switching to hash scans only improves the performance of the SQL thread when:
Even if enabling hash scans improves the performance enough for the replication slave to keep up, the permanent solution is to add an explicit primary key to each table. This should be the general rule of thumb in the schema design in order avoid and/or minimize many issues like slave performance too slow (as described in this post).
Next, I am going to investigate how we can find out the exact thread state using Performance Schema in order to make issue identification less of a guessing game.
Resources
RELATED POSTS