pt-kill: How it Works

pt-kill MySQLI receive many questions about errors being reported about pt-kill… and often those errors are not real. The main problem I have found is that pt-kill is hard to understand and the documentation doesn’t have many examples (I’ll add them), so, I decided to write a post about the most common usage of pt-kill: to kill the queries that match a regexp (usually a SELECT) and have been running for more than n seconds.

Let’s assume that we are running a tool like sysbench that executes queries like:

and at the same time, in a different process, we are running:

Then, while those queries are running, we start pt-kill with these parameters:

In the output, we will have something like this:

It is clear that SELECT SLEEP(10)  should be killed since it exceeds the --busy-time 2s but why are the other queries being killed too if the execution time is 0 seconds?

Well, it is because they match the specified criteria. How so? Let’s analyze how the program works.

First, it creates “buckets”,  grouping the queries according to the --group-by criteria. On our command line, there are no grouping criteria so there will only be one bucket with all the found queries in it. The second step here is to check the matching criteria. It is important to point out that all the specified criteria are combined via the logical OR.

We have two matching conditions:

--busy-time 2s
--match-info "(select|SELECT)"

The SELECT SLEEP(10)  will match both, so, it should be killed but all other selects will match --match-info "(select|SELECT)". Since the matching criteria is applied via OR, where at least ONE of the conditions should match, all other queries (the ones from sysbench) will be eligible for kill (or print).

‑‑each-busy-time

According to the documentation:

‑‑each-busy-time

type: time; group: Class Matches

Match query class if each query has been running for longer than this time. “Longer than” means that if you specify 10, for example, the class will only match if each and every query has been running for greater than 10 seconds.

What happens if we use --each-busy-time instead of --busy-time?

Answer: no queries will be killed.

Why? The reason for not killing either SELECT SLEEP(10)  or SELECT c FROM sbtest24 WHERE id=4997  is that --each-busy-time will kill the queries only if all the queries in the bucket exceed the specified period (2 seconds in this example) and since we haven’t specified a group by condition, we have only one bucket with all the queries in it. Let’s make it easier to visualize:

Here, in the same bucket (group) we have queries running for 10 seconds and fast queries that take 0 seconds to run. Going back to the documentation, --each-busy-time  matches query class (bucket) if each query has been running for longer than this time and here not all queries have been running for more than 2 seconds; we have one query running for more than 2 seconds but all the others were running for 0 seconds. So, the criteria do not match and no queries are going to be killed.

‑‑group-by to the rescue

From the previous example, it is clear that not all queries should be considered together. Using --group-by will let us separate them into different buckets (classes).
Let’s try pt-kill using these parameters: --kill-busy-commands=Query --run-time 30m --interval 1 --each-busy-time 2s --print --match-info "(select|SELECT)" --group-by=fingerprint .

Now, pt-kill will have 2 buckets as follows:

The matching criterion (–each-busy-time) will be applied to each group independently. For the first group (SELECT SLEEP), it does match since SELECT has been running for more than 2 seconds and then it will be killed. But, for the second group, no queries will match since the execute time for all of them is 0 and to be considered for killing the execute time for all of them should be higher than 2 seconds. So, the queries in the second group won’t be killed and we have achieved what we wanted in the first place.

Happy query killing!

Share this post

Leave a Reply