Downloads

Blog

Advanced Query Analysis in Percona Monitoring and Management with Direct ClickHouse Access

March 30, 2020

Author

Share this Post:

query analysis clickhouse PMM In my Webinar on Using Percona Monitoring and Management (PMM) for MySQL Troubleshooting, I showed how to use direct queries to ClickHouse for advanced query analysis tasks. In the followup Webinar Q&A, I promised to describe it in more detail and share some queries, so here it goes.

PMM uses ClickHouse to store query performance data which gives us great performance and a very high compression ratio. ClickHouse stores data in column-store format so it handles denormalized data very well. As a result, all query performance data is stored in one simple “metrics” table:

Column Name	Comment
queryid	hash of query fingerprint
service_name	Name of service (IP or hostname of DB server by default)
database	PostgreSQL: database
schema	MySQL: database; PostgreSQL: schema
username	client user name
client_host	client IP or hostname
replication_set	Name of replication set
cluster	Cluster name
service_type	Type of service
service_id	Service identifier
environment	Environment name
az	Availability zone
region	Region name
node_model	Node model
node_id	Node identifier
node_name	Node name
node_type	Node type
machine_id	Machine identifier
container_name	Container name
container_id	Container identifier
labels.key	Custom labels names
labels.value	Custom labels values
agent_id	Identifier of agent that collect and send metrics
agent_type	qan-agent-type-invalid = 0 qan-mysql-perfschema-agent,= 1 qan-mysql-slowlog-agent,= 2 qan-mongodb-profiler-agent,= 3 qan-postgresql-pgstatements-agent,= 4 Agent Type that collect of metrics: slowlog,perf schema,etc.
period_start	Time when collection of bucket started
period_length	Duration of collection bucket
fingerprint	mysql digest_text; query without data
example	One of query example from set found in bucket
example_format	EXAMPLE_FORMAT_INVALID = 0 EXAMPLE = 1 FINGERPRINT = 2 Indicates that collect real query examples is prohibited
is_truncated	Indicates if query examples is too long and was truncated
example_type	EXAMPLE_TYPE_INVALID = 0 RANDOM = 1 SLOWEST = 2 FASTEST = 3 WITH_ERROR = 4 Indicates what query example was picked up
example_metrics	Metrics of query example in JSON format.
num_queries_with_warnings	How many queries was with warnings in bucket
warnings.code	List of warnings
warnings.count	Count of each warnings in bucket
num_queries_with_errors	How many queries was with error in bucket
errors.code	List of Last_errno
errors.count	Count of each Last_errno in bucket
num_queries	Amount queries in this bucket
m_query_time_cnt	The statement execution time in seconds was met.
m_query_time_sum	The statement execution time in seconds.
m_query_time_min	Smallest value of query_time in bucket
m_query_time_max	Biggest value of query_time in bucket
m_query_time_p99	99 percentile of value of query_time in bucket
m_lock_time_cnt
m_lock_time_sum	The time to acquire locks in seconds.
m_lock_time_min
m_lock_time_max
m_lock_time_p99
m_rows_sent_cnt
m_rows_sent_sum	The number of rows sent to the client.
m_rows_sent_min
m_rows_sent_max
m_rows_sent_p99
m_rows_examined_cnt
m_rows_examined_sum	Number of rows scanned – SELECT.
m_rows_examined_min
m_rows_examined_max
m_rows_examined_p99
m_rows_affected_cnt
m_rows_affected_sum	Number of rows changed – UPDATE
m_rows_affected_min
m_rows_affected_max
m_rows_affected_p99
m_rows_read_cnt
m_rows_read_sum	The number of rows read from tables.
m_rows_read_min
m_rows_read_max
m_rows_read_p99
m_merge_passes_cnt
m_merge_passes_sum	The number of merge passes that the sort algorithm has had to do.
m_merge_passes_min
m_merge_passes_max
m_merge_passes_p99
m_innodb_io_r_ops_cnt
m_innodb_io_r_ops_sum	Counts the number of page read operations scheduled.
m_innodb_io_r_ops_min
m_innodb_io_r_ops_max
m_innodb_io_r_ops_p99
m_innodb_io_r_bytes_cnt
m_innodb_io_r_bytes_sum	Similar to innodb_IO_r_ops
m_innodb_io_r_bytes_min
m_innodb_io_r_bytes_max
m_innodb_io_r_bytes_p99
m_innodb_io_r_wait_cnt
m_innodb_io_r_wait_sum	Shows how long (in seconds) it took InnoDB to actually read the data from storage.
m_innodb_io_r_wait_min
m_innodb_io_r_wait_max
m_innodb_io_r_wait_p99
m_innodb_rec_lock_wait_cnt
m_innodb_rec_lock_wait_sum	Shows how long (in seconds) the query waited for row locks.
m_innodb_rec_lock_wait_min
m_innodb_rec_lock_wait_max
m_innodb_rec_lock_wait_p99
m_innodb_queue_wait_cnt
m_innodb_queue_wait_sum	Shows how long (in seconds) the query spent either waiting to enter the InnoDB queue or inside that queue waiting for execution.
m_innodb_queue_wait_min
m_innodb_queue_wait_max
m_innodb_queue_wait_p99
m_innodb_pages_distinct_cnt
m_innodb_pages_distinct_sum	Counts approximately the number of unique pages the query accessed.
m_innodb_pages_distinct_min
m_innodb_pages_distinct_max
m_innodb_pages_distinct_p99
m_query_length_cnt
m_query_length_sum	Shows how long the query is.
m_query_length_min
m_query_length_max
m_query_length_p99
m_bytes_sent_cnt
m_bytes_sent_sum	The number of bytes sent to all clients.
m_bytes_sent_min
m_bytes_sent_max
m_bytes_sent_p99
m_tmp_tables_cnt
m_tmp_tables_sum	Number of temporary tables created on memory for the query.
m_tmp_tables_min
m_tmp_tables_max
m_tmp_tables_p99
m_tmp_disk_tables_cnt
m_tmp_disk_tables_sum	Number of temporary tables created on disk for the query.
m_tmp_disk_tables_min
m_tmp_disk_tables_max
m_tmp_disk_tables_p99
m_tmp_table_sizes_cnt
m_tmp_table_sizes_sum	Total Size in bytes for all temporary tables used in the query.
m_tmp_table_sizes_min
m_tmp_table_sizes_max
m_tmp_table_sizes_p99
m_qc_hit_cnt
m_qc_hit_sum	Query Cache hits.
m_full_scan_cnt
m_full_scan_sum	The query performed a full table scan.
m_full_join_cnt
m_full_join_sum	The query performed a full join (a join without indexes).
m_tmp_table_cnt
m_tmp_table_sum	The query created an implicit internal temporary table.
m_tmp_table_on_disk_cnt
m_tmp_table_on_disk_sum	The querys temporary table was stored on disk.
m_filesort_cnt
m_filesort_sum	The query used a filesort.
m_filesort_on_disk_cnt
m_filesort_on_disk_sum	The filesort was performed on disk.
m_select_full_range_join_cnt
m_select_full_range_join_sum	The number of joins that used a range search on a reference table.
m_select_range_cnt
m_select_range_sum	The number of joins that used ranges on the first table.
m_select_range_check_cnt
m_select_range_check_sum	The number of joins without keys that check for key usage after each row.
m_sort_range_cnt
m_sort_range_sum	The number of sorts that were done using ranges.
m_sort_rows_cnt
m_sort_rows_sum	The number of sorted rows.
m_sort_scan_cnt
m_sort_scan_sum	The number of sorts that were done by scanning the table.
m_no_index_used_cnt
m_no_index_used_sum	The number of queries without index.
m_no_good_index_used_cnt
m_no_good_index_used_sum	The number of queries without good index.
m_docs_returned_cnt
m_docs_returned_sum	The number of returned documents.
m_docs_returned_min
m_docs_returned_max
m_docs_returned_p99
m_response_length_cnt
m_response_length_sum	The response length of the query result in bytes.
m_response_length_min
m_response_length_max
m_response_length_p99
m_docs_scanned_cnt
m_docs_scanned_sum	The number of scanned documents.
m_docs_scanned_min
m_docs_scanned_max
m_docs_scanned_p99
m_shared_blks_hit_cnt
m_shared_blks_hit_sum	Total number of shared blocks cache hits by the statement
m_shared_blks_read_cnt
m_shared_blks_read_sum	Total number of shared blocks read by the statement.
m_shared_blks_dirtied_cnt
m_shared_blks_dirtied_sum	Total number of shared blocks dirtied by the statement.
m_shared_blks_written_cnt
m_shared_blks_written_sum	Total number of shared blocks written by the statement.
m_local_blks_hit_cnt
m_local_blks_hit_sum	Total number of local block cache hits by the statement
m_local_blks_read_cnt
m_local_blks_read_sum	Total number of local blocks read by the statement.
m_local_blks_dirtied_cnt
m_local_blks_dirtied_sum	Total number of local blocks dirtied by the statement.
m_local_blks_written_cnt
m_local_blks_written_sum	Total number of local blocks written by the statement.
m_temp_blks_read_cnt
m_temp_blks_read_sum	Total number of temp blocks read by the statement.
m_temp_blks_written_cnt
m_temp_blks_written_sum	Total number of temp blocks written by the statement.
m_blk_read_time_cnt
m_blk_read_time_sum	Total time the statement spent reading blocks
m_blk_write_time_cnt
m_blk_write_time_sum	Total time the statement spent writing blocks

I provided the whole table structure here as it includes a description for many columns. Note not all columns will contain data for all database engines in all configurations, and some are not yet used at all.

Before we get to queries let me explain some general design considerations for this table.

We do not store performance information for every single query; it is not always available to begin with (for example, if using MySQL Performance Schema). Even if it was available with modern database engines capable of serving 1M+ QPS, it would still be a lot of data to store and process.

Instead, we aggregate statistics by “buckets” which can be seen as sort key in the “metrics” table:

ORDER BY (queryid,service_name,database,schema,username,client_host,period_start)

1	ORDER BY (queryid,service_name,database,schema,username,client_host,period_start)

You can think about Sort Key as similar to Clustered Index in MySQL. Basically, for every period (1 minute by default) we store information for every queried, service_name, database, schema, username, and client_host combination.

Period_Start is stored in the UTC timezone.

QueryID – is a hash which identifies unique query pattern, such as “select c from sbtest1 where id=?”

Service_Name is the name of the database instance

Database – is the database or Catalog. We use it in PostgreSQL terminology, not MySQL one

Schema – this is Schema, which also can be referred to as Database in MySQL

UserName – The Database level User Name, which ran this given query.

Client_Host – HostName or IP of the Client

This data storage format allows us to provide a very detailed workload analysis, for example, you can see if there is a difference in performance profile between different schemas, which is very valuable for many applications that use the “tenant per schema” approach. Or you can see specific workloads that different users generate on your database fleet.

Another thing you may notice is that each metric for each grouping bucket stores several statistical values, such as:

 `m_query_time_cnt` Float32 COMMENT 'The statement execution time in seconds was met.',
 `m_query_time_sum` Float32 COMMENT 'The statement execution time in seconds.',
 `m_query_time_min` Float32 COMMENT 'Smallest value of query_time in bucket',
 `m_query_time_max` Float32 COMMENT 'Biggest value of query_time in bucket',
 `m_query_time_p99` Float32 COMMENT '99 percentile of value of query_time in bucket',

`m_query_time_cnt` Float32 COMMENT 'The statement execution time in seconds was met.',

`m_query_time_sum` Float32 COMMENT 'The statement execution time in seconds.',

`m_query_time_min` Float32 COMMENT 'Smallest value of query_time in bucket',

`m_query_time_max` Float32 COMMENT 'Biggest value of query_time in bucket',

`m_query_time_p99` Float32 COMMENT '99 percentile of value of query_time in bucket',

The _cnt value is the number of times this metric was reported. Every Query should have query_time available but many other measurements may not be available for every engine and any configuration.

The _sum value is the sum for the metric among all _cnt queries. So if you want to compute _avg you should divide _sum by _cnt.

_min, _max and _p99 store the minimum, maximum, and 99 percentile value.

How to Access ClickHouse

To access ClickHouse on PMM Server you should run the “clickhouse-client” command line tool.

If you’re deploying MySQL with Docker you can just run:

docker exec -it pmm2-server clickhouse-client

1	docker exec -it pmm2-server clickhouse-client

Where pmm2-server is the name of the container you’re using for PMM.

Run “use pmm” to select the current schema to PMM.

Query Examples

ClickHouse uses SQL-like language as its query language. I call it SQL-like as it does not implement SQL standard fully, yet it has many additional and very useful extensions. You can find the complete ClickHouse Query Language reference here.

# Number of Queries for the period

select sum(num_queries) from metrics where period_start>'2020-03-18 00:00:00';

# Average Query Execution Time for Last 6 hours 

select avg(m_query_time_sum/m_query_time_cnt)  from metrics where period_start>subtractHours(now(),6);

# What are most frequent query ids ?  Also calculate total number of queries in the same query

select queryid,sum(num_queries) cnt from metrics where period_start>subtractHours(now(),6) group by queryid with totals order by cnt desc limit 10;

# How do actual queries look for those IDs ?

select any(example),sum(num_queries) cnt from metrics where period_start>subtractHours(now(),6) group by queryid order by cnt desc limit 10 G

# queries hitting particular host 

select any(example),sum(num_queries) cnt from metrics where period_start>subtractHours(now(),6) and node_name='mysql1' group by queryid order by cnt desc limit 10 G

# slowest instances of the queries

select any(example),sum(num_queries) cnt, max(m_query_time_max) slowest  from metrics where period_start>subtractHours(now(),6)  group by queryid order by slowest desc limit 10 G


# Query pattern which resulted in the largest temporary table created
select example, m_tmp_table_sizes_max  from metrics where period_start>subtractHours(now(),6) order by m_tmp_table_sizes_max desc limit 1 G 


# Slowest Queries Containing Delete in the text 

select any(example),sum(num_queries) cnt, max(m_query_time_max) slowest  from metrics where period_start>subtractHours(now(),6) and lowerUTF8(example) like '%delete%'  group by queryid order by slowest desc limit 10 G

# Number of Queries for the period

select sum(num_queries) from metrics where period_start>'2020-03-18 00:00:00';

# Average Query Execution Time for Last 6 hours

select avg(m_query_time_sum/m_query_time_cnt) from metrics where period_start>subtractHours(now(),6);

# What are most frequent query ids ? Also calculate total number of queries in the same query

select queryid,sum(num_queries) cnt from metrics where period_start>subtractHours(now(),6) group by queryid with totals order by cnt desc limit 10;

# How do actual queries look for those IDs ?

select any(example),sum(num_queries) cnt from metrics where period_start>subtractHours(now(),6) group by queryid order by cnt desc limit 10 G

# queries hitting particular host

select any(example),sum(num_queries) cnt from metrics where period_start>subtractHours(now(),6) and node_name='mysql1' group by queryid order by cnt desc limit 10 G

# slowest instances of the queries

select any(example),sum(num_queries) cnt, max(m_query_time_max) slowest from metrics where period_start>subtractHours(now(),6) group by queryid order by slowest desc limit 10 G

# Query pattern which resulted in the largest temporary table created

select example, m_tmp_table_sizes_max from metrics where period_start>subtractHours(now(),6) order by m_tmp_table_sizes_max desc limit 1 G

# Slowest Queries Containing Delete in the text

select any(example),sum(num_queries) cnt, max(m_query_time_max) slowest from metrics where period_start>subtractHours(now(),6) and lowerUTF8(example) like '%delete%' group by queryid order by slowest desc limit 10 G

I hope this gets you started!

If you create some other queries which you find particularly helpful, please feel free to leave them in the comments for others to enjoy!

0 0 votes

Article Rating

3 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Fernando Ipar

6 years ago

This is great Peter, thanks!.

A quick comment: I’m testing 2.4.0 and I need to add -h 127.0.0.1 to the clickhouse-client docker exec line before I can connect. Just commenting in case other people have the same issue.

Author

Peter Zaitsev

6 years ago

Hi Fernando,

Great to hear from you!
What error are you getting ? For me it works well without it – checked now on 2 different systems:

[root@celenuc2 ~]# docker exec -it pmm-server clickhouse-client
ClickHouse client version 19.7.5.27.
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 19.7.5 revision 54419.