I receive many questions about errors being reported about pt-kill… and often those errors are not real. The main problem I have found is that pt-kill is hard to understand and the documentation doesn’t have many examples (I’ll add them), so, I decided to write a post about the most common usage of pt-kill: to kill the queries that match a regexp (usually a SELECT) and have been running for more than n seconds.
Let’s assume that we are running a tool like sysbench that executes queries like:
|
1 |
select distinct c from sbtest1 where id between 1 and 20 order by c<br>select distinct c from sbtest2 where id between 5 and 25 order by c<br>select distinct c from sbtest3 where id between 3 and 7 order by c<br>select distinct c from sbtest1 where id between 2 and 8 order by c<br>select distinct c from sbtest3 where id between 15 and 30 order by c |
and at the same time, in a different process, we are running:
|
1 |
SELECT SLEEP 10 |
Then, while those queries are running, we start pt-kill with these parameters:
|
1 |
--kill-busy-commands=Query --run-time 30m --interval 1 --busy-time 2s --print --match-info "(select|SELECT)" |
In the output, we will have something like this:
|
1 |
# 2019-06-20T11:40:23 KILL 31 (Query 2 sec) SELECT SLEEP(10)<br># 2019-06-20T11:40:23 KILL 513 (Execute 0 sec) SELECT DISTINCT c FROM sbtest48 WHERE id BETWEEN 4980 AND 5079 ORDER BY c<br># 2019-06-20T11:40:23 KILL 524 (Execute 0 sec) SELECT c FROM sbtest24 WHERE id=4997<br># 2019-06-20T11:40:23 KILL 512 (Execute 0 sec) SELECT c FROM sbtest48 WHERE id BETWEEN 5001 AND 5100 ORDER BY c<br># 2019-06-20T11:40:23 KILL 516 (Execute 0 sec) SELECT SUM(k) FROM sbtest41 WHERE id BETWEEN 4988 AND 5087 |
It is clear that SELECT SLEEP(10) should be killed since it exceeds the --busy-time 2s but why are the other queries being killed too if the execution time is 0 seconds?
Well, it is because they match the specified criteria. How so? Let’s analyze how the program works.
First, it creates “buckets”, grouping the queries according to the --group-by criteria. On our command line, there are no grouping criteria so there will only be one bucket with all the found queries in it. The second step here is to check the matching criteria. It is important to point out that all the specified criteria are combined via the logical OR.
We have two matching conditions:
--busy-time 2s
--match-info "(select|SELECT)"
The SELECT SLEEP(10) will match both, so, it should be killed but all other selects will match --match-info "(select|SELECT)". Since the matching criteria is applied via OR, where at least ONE of the conditions should match, all other queries (the ones from sysbench) will be eligible for kill (or print).
According to the documentation:
‑‑each-busy-time
type: time; group: Class Matches
Match query class if each query has been running for longer than this time. “Longer than” means that if you specify 10, for example, the class will only match if each and every query has been running for greater than 10 seconds.
What happens if we use --each-busy-time instead of --busy-time?
Answer: no queries will be killed.
Why? The reason for not killing either SELECT SLEEP(10) or SELECT c FROM sbtest24 WHERE id=4997 is that --each-busy-time will kill the queries only if all the queries in the bucket exceed the specified period (2 seconds in this example) and since we haven’t specified a group by condition, we have only one bucket with all the queries in it. Let’s make it easier to visualize:
|
1 |
Bucket:<br>default<br><br>Queries:<br>- SELECT SLEEP (10)<br>- SELECT c FROM sbtest24 WHERE id=4997 |
Here, in the same bucket (group) we have queries running for 10 seconds and fast queries that take 0 seconds to run. Going back to the documentation, --each-busy-time matches query class (bucket) if each query has been running for longer than this time and here not all queries have been running for more than 2 seconds; we have one query running for more than 2 seconds but all the others were running for 0 seconds. So, the criteria do not match and no queries are going to be killed.
From the previous example, it is clear that not all queries should be considered together. Using
--group-by will let us separate them into different buckets (classes).
Let’s try pt-kill using these parameters:
--kill-busy-commands=Query --run-time 30m --interval 1 --each-busy-time 2s --print --match-info "(select|SELECT)" --group-by=fingerprint .
Now, pt-kill will have 2 buckets as follows:
|
1 |
Bucket:<br>select sleep(?)<br><br>Queries:<br>- (Query 3 sec) SELECT SLEEP(10) |
|
1 |
Bucket:<br>select c from sbtest? where id=?<br><br>Queries:<br># (Execute 0 sec) SELECT DISTINCT c FROM sbtest9 WHERE id BETWEEN 4985 AND 5084 ORDER BY c<br># (Execute 0 sec) SELECT SUM(k) FROM sbtest32 WHERE id BETWEEN 4986 AND 5085<br># (Execute 0 sec) SELECT c FROM sbtest8 WHERE id=5983<br># (Execute 0 sec) SELECT DISTINCT c FROM sbtest20 WHERE id BETWEEN 5570 AND 5669 ORDER BY c<br># (Execute 0 sec) SELECT c FROM sbtest30 WHERE id BETWEEN 4504 AND 4603 ORDER BY c<br># (Execute 0 sec) SELECT c FROM sbtest8 WHERE id BETWEEN 4988 AND 5087<br># (Execute 0 sec) SELECT c FROM sbtest27 WHERE id=5013 |
The matching criterion (–each-busy-time) will be applied to each group independently. For the first group (SELECT SLEEP), it does match since SELECT has been running for more than 2 seconds and then it will be killed. But, for the second group, no queries will match since the execute time for all of them is 0 and to be considered for killing the execute time for all of them should be higher than 2 seconds. So, the queries in the second group won’t be killed and we have achieved what we wanted in the first place.
Happy query killing!