Ignacio Nin and I (mostly Ignacio) have worked together to create tcprstat[1], a new tool that times TCP requests and prints out statistics on them. The output looks somewhat like vmstat or iostat, but we’ve chosen the statistics carefully so you can compute meaningful things about your TCP traffic.
What is this good for? In a nutshell, it is a lightweight way to measure response times on a server such as a database, memcached, Apache, and so on. You can use this information for historical metrics, capacity planning, troubleshooting, and monitoring to name just a few.
The tcprstat tool itself is a means of gathering raw statistics, which are suitable for storing and manipulating with other programs and scripts. By default, tcprstat works just like vmstat: it runs once, prints out a line, and exits. You’ll probably want to tell it to run forever, and continue to print out more lines. Each line contains a timestamp and information about the response time of the requests within that time period. Here “response time” means, for a given TCP connection, the time elapsed from the last inbound packet until the first outbound packet. For many simple protocols such as HTTP and MySQL, this is the moral equivalent of a query’s response time.
The statistics we chose to output by default are the count, median, average, min, max, and standard deviation of the response times, in microseconds. These are repeated for the 95th and 99th percentiles as well. Other metrics are also available. Here’s a sample:
|
1 |
<pre>[root@server] # tcprstat -p 3306 -n 0 -t 1<br>timestamp count max min avg med stddev 95_max 95_avg 95_std 99_max 99_avg 99_std<br>1276827985 1341 24556 23 149 59 767 310 91 69 1030 107 112<br>1276827986 1329 12098 28 134 63 461 299 91 65 667 104 93<br>1276827987 1180 13277 22 202 93 873 439 103 79 1523 131 169<br>1276827988 1441 15878 27 180 139 672 427 116 79 1045 136 128<br>1276827989 1432 157198 26 272 138 4165 405 115 80 1092 134 123<br>1276827990 1835 25198 26 183 124 734 448 115 85 1141 137 141<br>1276827991 1242 6949 29 129 114 301 233 98 61 686 109 84<br>1276827992 1480 284181 25 442 127 7432 701 128 114 4157 173 293<br>1276827993 1448 9339 22 161 88 425 392 104 80 1280 126 140<br> |
tcprstat uses libpcap to capture traffic. It’s a threaded application that does the minimum possible work and uses efficient data structures. Your feedback on the kernel/userland exchange overhead caused by the packet sniffing would be very appreciated — libpcap allows the user to tune this exchange, so if you have suggestions on how to improve it, that’s great.
We build statically linked binaries with the preferred version of libpcap, which means there are no dependencies. You can just run the tool. In the future, packages in the Percona repositories will provide another means for rapid installation via yum and apt.
tcprstat is beta software. Several C/C++ experts reviewed its code and gave it a thumbs-up, so many eyes have been on the code. We’ve performed tests on servers with high loads and observed minimal resource consumption. I personally have been running it for many weeks on some production servers without stopping it and have seen no problems, so I am pretty sure it has no memory leaks or other problems. Nevertheless, it’s a first prototype release, and we want much more testing. We might also change the functionality; as we build tools around it, we discover new things that might be useful. When we’re happy with it and you’re happy with it, we’ll take the Beta label away and make it GA.
The tcprstat user’s manual and links to downloads are on the Percona wiki. Commercial support and services are provided by Percona. Bug reports, feature requests, etc should go to the Launchpad project linked from the user’s manual. General discussion is welcome on the Google Group also linked from the user’s manual.
[1] Historical note: we initially called this tool rtime, but did not publicize it. However, some of you might have heard of “rtime” before. This is the same tool.
Resources
RELATED POSTS