Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

How to identify and cure MySQL replication slave lag

May 2, 2014

Author

Muhammad Irfan

Insight for DBAs

MySQL

Share this Post:

Here on the Percona MySQL Support team, we often see issues where a customer is complaining about replication delays – and many times the problem ends up being tied to MySQL replication slave lag. This, of course, is nothing new for MySQL users and we’ve had a few posts here on the MySQL Performance Blog on this topic over the years (two particularly popular posts in the past were: “Managing Slave Lag with MySQL Replication,” both by Percona CEO Peter Zaitsev).

In today’s post, however, I will share some new ways of identifying delays in replication – including possible causes of lagging slaves – and how to cure this problem.

How to identify Replication Delay
MySQL replication works with two threads, IO_THREAD & SQL_THREAD. IO_THREAD connects to a master, reads binary log events from the master as they come in and just copies them over to a local log file called relay log. On the other hand, SQL_THREAD reads events from a relay log stored locally on the replication slave (the file that was written by IO thread) and then applies them as fast as possible. Whenever replication delays, it’s important to discover first whether it’s delaying on slave IO_THREAD or slave SQL_THREAD.

Normally, I/O thread would not cause a huge replication delay as it is just reading the binary logs from the master. However, It depends on the network connectivity, network latency… how fast is that between the servers. The Slave I/O thread could be slow because of high bandwidth usage. Usually, when the slave IO_THREAD is able to read binary logs quickly enough it copies and piles up the relay logs on the slave – which is one indication that the slave IO_THREAD is not the culprit of slave lag.

On the other hand, when the slave SQL_THREAD is the source of replication delays it is probably because of queries coming from the replication stream are taking too long to execute on the slave. This is sometimes because of different hardware between master/slave, different schema indexes, workload. Moreover, the slave OLTP workload sometimes causes replication delays because of locking. For instance, if a long-running read against a MyISAM table blocks the SQL thread or any transaction against an InnoDB table creates an IX lock and blocks DDL in the SQL thread. Also, take into account that slave is single threaded prior to MySQL 5.6, which would be another reason for delays on the slave SQL_THREAD.

MySQL replication slave lag examples

Let me show you via master status/slave status example to identify either slave is lagging on slave IO_THREAD or slave SQL_THREAD.

mysql-master> SHOW MASTER STATUS;
+------------------+--------------+------------------+------------------------------------------------------------------+
| File | Position  | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                |
+------------------+--------------+------------------+------------------------------------------------------------------+
| mysql-bin.018196 | 15818564     |                  | bb11b389-d2a7-11e3-b82b-5cf3fcfc8f58:1-2331947                   |
+------------------+--------------+------------------+------------------------------------------------------------------+

mysql-slave> SHOW SLAVE STATUSG
*************************** 1. row ***************************
Slave_IO_State: Queueing master event to the relay log
Master_Host: master.example.com
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.018192
Read_Master_Log_Pos: 10050480
Relay_Log_File: mysql-relay-bin.001796
Relay_Log_Pos: 157090
Relay_Master_Log_File: mysql-bin.018192
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB: 
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 5395871
Relay_Log_Space: 10056139
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 230775
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 2
Master_UUID: bb11b389-d2a7-11e3-b82b-5cf3fcfc8f58:2-973166
Master_Info_File: /var/lib/mysql/i1/data/master.info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Reading event from the relay log
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: bb11b389-d2a7-11e3-b82b-5cf3fcfc8f58:2-973166
Executed_Gtid_Set: bb11b389-d2a7-11e3-b82b-5cf3fcfc8f58:2-973166,
ea75c885-c2c5-11e3-b8ee-5cf3fcfc9640:1-1370
Auto_Position: 1

mysql-master> SHOW MASTER STATUS;

+------------------+--------------+------------------+------------------------------------------------------------------+

+------------------+--------------+------------------+------------------------------------------------------------------+

+------------------+--------------+------------------+------------------------------------------------------------------+

mysql-slave> SHOW SLAVE STATUSG

*************************** 1. row ***************************

Slave_IO_State: Queueing master event to the relay log

Master_Host: master.example.com

Master_User: repl

Master_Port: 3306

Connect_Retry: 60

Master_Log_File: mysql-bin.018192

Read_Master_Log_Pos: 10050480

Relay_Log_File: mysql-relay-bin.001796

Relay_Log_Pos: 157090

Relay_Master_Log_File: mysql-bin.018192

Slave_IO_Running: Yes

Slave_SQL_Running: Yes

Replicate_Do_DB:

Replicate_Ignore_DB:

Replicate_Do_Table:

Replicate_Ignore_Table:

Replicate_Wild_Do_Table:

Replicate_Wild_Ignore_Table:

Last_Errno: 0

Last_Error:

Skip_Counter: 0

Exec_Master_Log_Pos: 5395871

Relay_Log_Space: 10056139

Until_Condition: None

Until_Log_File:

Until_Log_Pos: 0

Master_SSL_Allowed: No

Master_SSL_CA_File:

Master_SSL_CA_Path:

Master_SSL_Cert:

Master_SSL_Cipher:

Master_SSL_Key:

Seconds_Behind_Master: 230775

Master_SSL_Verify_Server_Cert: No

Last_IO_Errno: 0

Last_IO_Error:

Last_SQL_Errno: 0

Last_SQL_Error:

Replicate_Ignore_Server_Ids:

Master_Server_Id: 2

Master_UUID: bb11b389-d2a7-11e3-b82b-5cf3fcfc8f58:2-973166

Master_Info_File: /var/lib/mysql/i1/data/master.info

SQL_Delay: 0

SQL_Remaining_Delay: NULL

Slave_SQL_Running_State: Reading event from the relay log

Master_Retry_Count: 86400

Master_Bind:

Last_IO_Error_Timestamp:

Last_SQL_Error_Timestamp:

Master_SSL_Crl:

Master_SSL_Crlpath:

Retrieved_Gtid_Set: bb11b389-d2a7-11e3-b82b-5cf3fcfc8f58:2-973166

Executed_Gtid_Set: bb11b389-d2a7-11e3-b82b-5cf3fcfc8f58:2-973166,

ea75c885-c2c5-11e3-b8ee-5cf3fcfc9640:1-1370

Auto_Position: 1

This clearly suggests that the slave IO_THREAD is lagging and obviously because of that the slave SQL_THREAD is lagging, too, and it yields replication delays. As you can see the Master log file is mysql-bin.018196 (File parameter from master status) and slave IO_THREAD is on mysql-bin.018192 (Master_Log_File from slave status) which indicates slave IO_THREAD is reading from that file, while on master it’s writing on mysql-bin.018196, so the slave IO_THREAD is behind by 4 binlogs. Meanwhile, the slave SQL_THREAD is reading from same file i.e. mysql-bin.018192 (Relay_Master_Log_File from slave status) This indicates that the slave SQL_THREAD is applying events fast enough, but it’s lagging too, which can be observed from the difference between Read_Master_Log_Pos & Exec_Master_Log_Pos from show slave status output.

You can calculate slave SQL_THREAD lag from Read_Master_Log_Pos – Exec_Master_Log_Pos in general as long as Master_Log_File parameter output from show slave status and Relay_Master_Log_File parameter from show slave status output are the same. This will give you a rough idea of how fast slave SQL_THREAD is applying events. As I mentioned above, the slave IO_THREAD is lagging as in this example then off course slave SQL_THREAD is behind too. You can read a detailed description of show slave status output fields here.

Also, the Seconds_Behind_Master parameter shows a huge delay in seconds. However, this can be misleading, because it only measures the difference between the timestamps of the relay log most recently executed, versus the relay log entry most recently downloaded by the IO_THREAD. If there are more binlogs on the master, the slave doesn’t figure them into the calculation of Seconds_behind_master. You can get a more accurate measure of slave lag using pt-heartbeat from Percona Toolkit. So, we learned how to check replication delays – either it’s slave IO_THREAD or slave SQL_THREAD. Now, let me provide some tips and suggestions for what exactly causing this delay.

Tips and Suggestions What Causing Replication Delay & Possible Fixes
Usually, the slave IO_THREAD is behind because of the slow network between master/slave. Most of the time, enabling slave_compressed_protocol helps to mitigate slave IO_THREAD lag. One other suggestion is to disable binary logging on slave as it’s IO intensive too unless you required it for point in time recovery.

To minimize slave SQL_THREAD lag, focus on query optimization. My recommendation is to enable the configuration option log_slow_slave_statements so that the queries executed by slave that take more than long_query_time will be logged to the slow log. To gather more information about query performance, I would also recommend setting the configuration option log_slow_verbosity to “full”.

This way we can see if there are queries executed by slave SQL_thread that are taking a long time to complete. You can follow my previous post about how to enable slow query log for a specific time period with mentioned options here. And as a reminder, log_slow_slave_statements as variable were first introduced in Percona Server 5.1 which is now part of vanilla MySQL from version 5.6.11. In upstream version of MySQL Server log_slow_slave_statements were introduced as command line option. Details can be found here while log_slow_verbosity is Percona Server specific feature.

One another reason for the delay on slave SQL_THREAD if you use row-based binlog format is that if your any database table missing is primary key or unique key then it will scan all rows of the table for DML on slave and causes replication delays so make sure all your tables should have primary key or unique key. Check this bug report for details http://bugs.mysql.com/bug.php?id=53375 You can use below query on slave to identify which of database tables missing primary or unique key.

mysql> SELECT t.table_schema,t.table_name,engine
FROM information_schema.tables t INNER JOIN information_schema .columns c
on t.table_schema=c.table_schema and t.table_name=c.table_name
GROUP BY t.table_schema,t.table_name
HAVING sum(if(column_key in ('PRI','UNI'), 1,0)) =0;

mysql> SELECT t.table_schema,t.table_name,engine

FROM information_schema.tables t INNER JOIN information_schema .columns c

on t.table_schema=c.table_schema and t.table_name=c.table_name

GROUP BY t.table_schema,t.table_name

HAVING sum(if(column_key in ('PRI','UNI'), 1,0)) =0;

One improvement is made for this case in MySQL 5.6, wherein memory hash is used slave_rows_search_algorithms comes to the rescue.

Note that Seconds_Behind_Master is not updated while we read huge RBR event, So, “lagging” may be related to just that – we had not completed reading of the event. For example, in row-based replication, huge transactions may cause delay on slave side e.g. if you have 10 million rows table and you do “DELETE FROM table WHERE id < 5000000” 5M rows will be sent to slave, each row separately which will be painfully slow. So, if you have to delete oldest rows time to time from huge table using Partitioning might be a good alternative for this for some kind of workloads where instead using DELETE use DROP old partition may be good and only statement is replicated because it will be DDL operation.

To explain it better, let suppose you have partition1 holding rows of ID’s from 1 to 1000000, partition2 – ID’s from 1000001 to 2000000 and so on so instead of deleting via statement “DELETE FROM table WHERE ID<=1000000;” you can do “ALTER TABLE DROP partition1;” instead. For alter partitioning operations check manual – Check this wonderful post too from my colleague Roman explaining possible grounds for replication delays here

pt-stalk is one of the finest tools from Percona Toolkit which collects diagnostics data when problems occur. You can setup pt-stalk as follows so whenever there is a slave lag it can log diagnostic information which we can be later analyzed to check to see what exactly causing the lag.

Here is how you can setup pt-stalk so that it captures diagnostic data when there is slave lag:

------- pt-plug.sh contents
#!/bin/bash

trg_plugin() {
mysqladmin $EXT_ARGV ping &> /dev/null
mysqld_alive=$?

if [[ $mysqld_alive == 0 ]]
then
seconds_behind_master=$(mysql $EXT_ARGV -e "show slave status" --vertical | grep Seconds_Behind_Master | awk '{print $2}')
echo $seconds_behind_master
else
echo 1
fi
}
# Uncomment below to test that trg_plugin function works as expected
#trg_plugin
-------

-- That's the pt-plug.sh file you would need to create and then use it as below with pt-stalk:

$ /usr/bin/pt-stalk --function=/root/pt-plug.sh --variable=seconds_behind_master --threshold=300 --cycles=60 --notify-by-email=muhammad@example.com --log=/root/pt-stalk.log --pid=/root/pt-stalk.pid --daemonize

------- pt-plug.sh contents

#!/bin/bash

trg_plugin() {

mysqladmin $EXT_ARGV ping &> /dev/null

mysqld_alive=$?

if [[ $mysqld_alive == 0 ]]

then

seconds_behind_master=$(mysql $EXT_ARGV -e "show slave status" --vertical | grep Seconds_Behind_Master | awk '{print $2}')

echo $seconds_behind_master

else

echo 1

}

# Uncomment below to test that trg_plugin function works as expected

#trg_plugin

-------

-- That's the pt-plug.sh file you would need to create and then use it as below with pt-stalk:

$ /usr/bin/pt-stalk --function=/root/pt-plug.sh --variable=seconds_behind_master --threshold=300 --cycles=60 --notify-by-email=muhammad@example.com --log=/root/pt-stalk.log --pid=/root/pt-stalk.pid --daemonize

You can adjust the threshold, currently its 300 seconds, combining that with –cycles, it means that if seconds_behind_master value is >= 300 for 60 seconds or more then pt-stalk will start capturing data. Adding –notify-by-email option will notify via email when pt-stalk captures data. You can adjust the pt-stalk thresholds accordingly so that’s how it triggers to collect diagnostic data during the problem.

Conclusion
A lagging slave is a tricky problem but a common issue in MySQL replication. I’ve tried to cover most aspects of MySQL replication slave lag delays in this post. Please share in the comments section if you know of any other reasons for replication delay. Having trouble with replication in MySQL? Our webinar, “MySQL Replication Troubleshooting” provides expert tips to help you identify and solve common replication errors.

0 0 votes

Article Rating

15 Comments

Oldest

Newest Most Voted

Simon Mudd

12 years ago

No mention of performance_schema or mysql -sys? sys.schema_table_statistics will point to the tables with most latency, and they are probably those you should focus on. I keep meaning to write a blog post with some examples.

Doug

12 years ago

Excellent breakdown, Muhammad.Thank you.

Said Bakr

12 years ago

Importing a large MySQL dump of 152 MB and more than 3 million rows of 450 MB on the database is a nightmare on My Windows 7 64 bit computer with 4 GB RAM!

aftab

12 years ago

log_slow_slave_statements won’t help you log slow queries when using RBR.

Steve S

10 years ago

Your code for finding tables without primary keys has an extra space in “information_schema .columns”. Also you said to disable binary logging on slave as it’s IO intensive, but on a slave it doesn’t write to the binary log unless log-slave-updates is turned on.

Dmitry Balabka

10 years ago

Question according to binlog format. You did not specify multiple primary keys in query.
HAVING sum(if(column_key in (‘PRI’,’UNI’,’MUL’), 1,0)) =0;
Did this inadvertently omitted or specially?

Ashwini

10 years ago

Hi,

Thanks for you post..
I want to get all sql queries should log during my slave lag.

Am using below command :

/usr/bin/pt-stalk –function=/root/pt-plug.sh –variable=seconds_behind_master –threshold=5 –cycles=7 –[email protected] –log=/mnt/pt-stalk.log –collect –collect-tcpdump –pid=/root/pt-stalk.pid –daemonize

This variable does not takes values –collect –collect-tcpdump.
So where my sqldump will get log ? whats the default location for this logs ?

Thanks In Advance !!

abdullah

10 years ago

once enabling the log-slow-slave-statements it will log slow queries into slow-query-log file or slave server will create any log file for log-slow-slave-statements?.Thanks

Vinay

9 years ago

Hi All ,

I am facing replication lag even I’m using multi thread replication could you please help to reduce the replication lag . it seems issue with SQL thread because I am getting SQL state Waiting for dependent transaction to commit

If someone help then it would be great help .

Slave_IO_State: Waiting for master to send event
Master_Host: 10.140.2.69
Master_User: repl
Master_Port: 3330
Connect_Retry: 60
Master_Log_File: mysql-bin.000360
Read_Master_Log_Pos: 745241233
Relay_Log_File: relay-log.000166
Relay_Log_Pos: 587394954
Relay_Master_Log_File: mysql-bin.000326
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table: urapport_contact.%
Replicate_Wild_Ignore_Table: mysql.%,test.%,information_schema.%,performance_schema.%
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 587394741
Relay_Log_Space: 37252499844
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 534243
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 693330
Master_UUID: b753362d-abe6-11e6-8cbc-8cdcd4b0b20d
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Waiting for dependent transaction to commit
Master_Retry_Count: 86400

Shikha Mehta

9 years ago

Hi, i have a serious question that is hindering me to perform this automatic failover. I have deployed a master slave replication using gtid mode ON. And now i want test auto failover over my master and want to see how switchover performs but i am not able to do so because my slave is lagging behind of master i think as in show slave status it clearly shows.Due to this my mysqlfailover is failing as my slave status is not ‘OK’ as it is required.Can you please help me to get rid of this . I seriously need help.

Vinod T Veettil

8 years ago

I am facing an issue where I am running periodic mysqldump on the slave which is causing replication lag on slave. Can i avoid this kind of delay ?

Andy H

8 years ago

Reply to Vinod T Veettil

Use a different tool. If you are using mysqldump as a backup, use percona xtrabackup or MySQL Enterprise backup instead.

prabhakerdeodixit

8 years ago

Hi ,
is there any solution to reduce slave lag in RBR format(primary key is not there on tables) like mysql slave_rows_search_algorithms

Matt

8 years ago

I was running into massive lag in my Slave_SQL process for RBR, and it turned out to be because I had innodb_flush_log_at_trx_commit=1 on the slave. Setting this to innodb_flush_log_at_trx_commit=2 solved the problem, and I caught up in less than a minute.