September 16, 2014

Percona XtraDB Cluster performance monitoring and troubleshooting

Percona XtraDB Cluster 5.6First of all we would like to thank all of you who attended the Feb. 19 MySQL webinar, “Performance Monitoring and Troubleshooting of Percona XtraDB Cluster.” We got some really good questions – many of which we didn’t have time to address during the sessions, so Johan Andersson, Severalnines CTO,and I are answering them here in this blog post.
You can also click the link about to view the recorded session and access our presentation slides.

Q: I have managed to get Cluster Control running in the past and am now going to try a new installation. However my DB systems do not have internet connectivity for the installation phase. I know it’s possible to work around that, but I haven’t been able to find any installation instructions that cover that scenario, and as a result had to go through a lot of trial-and-error during the installation.
A: The easiest is probably if you install the Cluster manually and then use setup-ui.sh to install the UI, just as Peter demonstrated.

Q: For cmonitoring, do you have Cacti integration in addition to Nagios?
A: Not yet, but we have received questions about it.

Q: Is it possible/difficult to add clustercontrol to an already existing production cluster?
A: Yes. This is the scenario that Peter showed in his demo.

Q: In a deployment with CC agent, does the agent keep OS stats during a connection outage with the controller? And if so, does that mean there is data loss in an agentless deployment if the controller cannot reach an agent for a while?
A: The agent does not keep the OS stats. It keeps no backlog. However, the agents “only” collect OS stats and the collected OS stats are not used to determine if a cluster is running or not. The controller has the final word on this matter, by checking if the node is reachable by ping and ssh. If the controller cannot reach the server by SSH or ping it will be deemed dead. SSH access has higher “relevance” than ping for checking if a server is up/down.

Q: What is the differences between Zabbix and this monitoring tool?
A: Zabbix is a very good, general purpose tool, while ClusterControl is more specialized towards database clusters. The key difference is that ClusterControl includes management (add nodes, backups, config management), and that it presents data as it comes from a cluster, and not a collection of nodes and servers. Also ClusterControl has a pre-defined set of metrics for database clusters, it’s advisors can provide advice about configuration best practices for example.

Q: Where can I find documentation for the ClusterControl REST API? I’m unable to find it on the severalnines KB.
A: The REST API documentation is currently not online, but thanks for mentioning it.

Q: Is there any way to identify dead-locks with PXC/Cluster Control?
A: Deaklocks are not PXC specific. When 2 transactions are conflicting (they are updating the same row for example) in regular InnoDB, one will wait for the other’s locks before executing. If 2 transactions are waiting on each other’s locks, you will have a deadlock. In normal InnoDB, deadlock detection is graph based, meaning that InnoDB maintains a “what transaction waits on what transaction” graph. If this graph has a loop, InnoDB rolls back one of the transactions with the deadlock detected. Since finding loops in graphs is more and more expensive as the graphs get larger, once the graph reaches a certain size, InnoDB will assume that most likely there is a deadlock, and will roll back some transactions.

With PXC, the situation can change. If the the transactions waiting on each other are on the same node, the behavior is exactly the same. If they are coming from different nodes, the transactions will be able to acquire the row lock, because PXC doesn’t escalate locks to remote nodes. When the write sets are replicated, one of the transactions (the first one which gets to the certification process) will certify successfully. The other will fail certification and will be rolled back because it tries to modify the same records. So ultimately we get kind of the same high level behavior (one transaction will be successful and the other is rolled back).

As we showed in the demonstration with wsrep_local_receive_queue, clustercontrol is able to graph and alert on mysql status variables. So, for the second case, when the transactions are happening on different nodes, graphing wsrep_local_certification_failures and wsrep_local_bf_aborts will help. And for the first case, the innodb_deadlocks status variable will help. Please note that in case of the certification failure, you will get a deadlock in the error message, but the situation is somewhat different than the regular InnoDB case.

Q: Can we have Reporting DB like Mysql slave
A: Yes it is possible to add a reporting slave. However in the current UI version, this slave will not be presented, but it should be. With Precona XtraDB Cluster 5.6 this is a lot easier. See Frederic’s blog post.

Q: Is there a Zabbix int. for CC?
A: Nope, not yet, but we have received questions about it. It is however trivial to read up data from the CMON DB and present it Zabbix.

Q: tx, this was excellent, i will try it today on my PXC
A: Thanks, mission accomplished:)!

About Peter Boros

Peter joined the European consulting team in May 2012. Before joining Percona, among many other things, he worked at Sun Microsystems, specialized there in performance tuning and was a DBA at Hungary's largest social networking site. He also taught many Oracle University MySQL courses. He has been using and working with open source software from early 2000s. Peter's first and foremost professional interest is performance tuning.

He currently lives in Budapest, Hungary with his wife and son.

Speak Your Mind

*