Percona XtraDB Cluster – installation and setup webinar follow up Q&A

Percona XtraDB Cluster – installation and setup webinar follow up Q&A


Thanks for all, who attended my webinar, I got many questions and I wanted to take this opportunity to answer them.

Q: Even ntp has a delay of 0.3-0.4 between servers does that mean a 0.25 as from logs can be an issue ?
A: My demo vms were running for a few hours before the webinar in my local virtualbox instance, and I tried to show the minimal installation required for XtraDB cluster. Unfortunately enough, I didn’t include NTP, which caused SSTs to fail with the tar stream being from the future. The 0.3 – 0.4 second delay seems too much to me. According to the NTP standard, if the delay between the hosts is greater than 128 ms, NTP won’t work. So far for background and theory, from the practical point of view, you should have ntpd running, and you should monitor stratum. If you are using ntp servers stratum should be 3 or less, you can monitor it with

For the record, when I set the clock on my vms, they were almost off by 90s compared to the reference.

Q: You have to bootsrap it every time a node is down?
A: No. You have to bootstrap every time all nodes in the cluster down when you bring up the first node. If a node is down, when it comes back online, you have to perform a state transfer, which can be IST (Incremental State Transfer) and SST (Snapshot State Transfer). If the node has recent enough data that the write sets that needs to be applied are in the online node’s gcache, it will only perform an IST. When I mention bootstrapping here and in the webinar, I mean wsrep_cluster_address=’gcomm://’ and not ‘pc.bootstrap=true’.

Q: Can you quickly show the my.cnf settings again?
A: Sure, see this post or webinar recording.

Q: Fault tolerance: Can you recommend a three node cluster for production use?
A: Yes, that’s the recommended configuration. The number of nodes/data centers should be odd, so you can always decide which part is in majority in case of a network split.

Q: How does latency and bandwidth affect performance? Is it practical to buildthe cluster over a wan connection with ~ 5 ms latency?
A: Yes, it can be practical. The response time of your transactions will be affected by latency, but since we are using parallel replication here, with many parallel threads, this means that you can achieve a high throughput despite the response time limitations. Even with parallel replication, the response time limits the number of transactions / second that you can do on a single record. This means that if you have “hot database areas (single records that are modified very frequently)” high latency can potentially be an issue, otherwise this is mitigated by parallel replication. In this case the cost of latency on commit will be 5 ms each time.

Q: Can I change the replication method between servers?
A: If you would like to switch back and forth between galera and built-in mysql replication (asynchronous) for a given node, it’s not possible. However, a node in PXC can be a master or a slave on asynchronous replication, galera and built-in mysql replication are independent from each other.

Q: I installed Percona as cluster database working with Zabbix (innodb), do you have a recommended configuration?
A: We don’t have any specific recommandations for Zabbix, but general InnoDB tuning recommendations apply here, in large zabbix deployments, you may want to partition your large history tables, so it can keep up with the writes and/or configure and rethink the retention of the raw data.

Q: What are the steps to automatically restart a Percona Cluster after an outage?
A: It depends of the type of failure, first you can try to just restart the nodes. If you can’t bring it up, your last resort is bootstrap the first node, and do an SST/IST with the rest of the nodes.

Q: We are missing SST_AUTH parameter, rigth ? It is not possible to do SST without this, as per my testing…or am I wrong here ?
A: In the demo I used a blank root password for all nodes, so it did work without it.

Q: For automatic cluster starting, should it be ok to have in [mysqld_safe] the following: wsrep_urls=gcomm://IP1:4567,gcomm://IP2:4567,gcomm://IP3:4567,gcomm://
A: This will work for automatic cluster starting, but unfortunately it can lead to inconsistency in case of a network split. If a network split happens and a node decides to shut down, and somehow restarts because of this automatic process, it gets bootstrapped because of this option, and you can end up having 2 different clusters serving data to different groups of clients. So, the last gcomm:// option should be avoided. Although it can be helpful when you are testing certain scenarios, don’t forget to remove it before going to production.

Q: How would you migrate to XtradbCluster from a traditional Master Slave scenario when the data set is very large? Can you swap in Binaries ?
A: Check out Jay Janssen’s webinar about the topic.

Q: When one node in a 3 node cluster is synchronizing with the group, are the other two nodes available to serve data during the synchronization?
A: The node synchronizing should get the data from somewhere, it is from an other node. That node can be available during this process it you are using the xtrabackup SST method. Even if you are using an SST method which is blocking (rsync for example), you will have one node available in this situation.

Q: Does galera async replication fit in with percona XtraDB Cluster? i.e. can I run a production cluster in synchronous, and a disaster recovery location replicated asynchronously?
A: You can use asynchrounous replication between 2 PXC clusters, with one node being the master from the first cluster, and one node being the slave from the second cluster. However, if you lose either of them, you can’t get a consistent binlog position in most causes to continue replication.

Q: I hope that one point that is addressed, if not part of the presentation then as a comment near the end, is how to start up a cluster from scratch (i.e. a standalone XtraDB Cluster machine, convert it to a cluster type, and add nodes to it)
A: For details on starting up a cluster from scratch, see my earlier blogpost , for migration see Jay Janssen’s webinar about the topic.

Q: Do you recomend this configuration for servers in different datacenters? thanks!
A: Yes, as long as there are at least 3 datacenters, and the total number of the datacenters is odd.

Q: How is configuration for the arbitrator?
A: You should use the garbd binary for that, is takes wsrep_cluster_address and wsrep_cluster_name as an argument. It will join the cluster, but won’t store or serve data.

Q: Is the configuration file my.cnf ?
A: Yes.

Q: If the cluster blocks write while it commits to another host, does that not limit write capacity to the “weakest link”. Or have i missed something?
A: The cluster doesn’t block for the duration of the write, but only for certifying that write. See virtually synchronous replication of the cluster explained here.

Q: What about myisam storage? is it safe to use it? Does xtrabackup works as a sst method for myisam?
A: Yes, but for MyISAM xtrabackup is blocking, since it is backed up by the wrapper script called innobackupex, and not xtrabackup itself while holding a lock using FLUSH TABLES WITH READ LOCK. MyISAM support is experimental. Apart from some very special, edge cases, I would not recommend using MyISAM.

Q: How does write performance scale with an increasing number of nodes?
A: Since all the nodes has to do all the writes, you can’t scale writes infinitely by adding nodes. I would expect that writing to a few nodes is faster then writing only to 1, the reason for this is that only 1 node has to parse SQL, the rest just has to apply RBR events.

Q: I saw an rm -rf * in the demo what folder is that in and when is it appropriate to do it?
A: I wanted to show you that during the SST the node is actually rebuilding, also you can force an SST this way (although for forcing an SST it’s enough if you delete grastate.dat). So, it is appropriate for demo purposes when you would like to show that the full data is copied when you SST again.

Q: Thanx Peter, I loved the webinar!  Especially the fact that that everything is live!
A: Cheers, you are welcome, I am glad you liked it:). Thanks everyone for all these questions, feel free to ask additional ones in comments.


Share this post

Comments (4)

  • Marcus Bointon Reply

    NTP in VMs can be problematic – I’ve seen qemu/KVM drifting by over a second per minute!

    October 29, 2012 at 6:16 am
  • Henrik Ingo Reply

    Hi Peter.

    You audience has interesting questions. Thanks for spreading the Galera knowledge!

    Q: How does write performance scale with an increasing number of nodes?

    There is another, more simple answer to this question. In practice your write transactions are not 100% writes but contain a mix of selects, updates and maybe inserts and deletes. For example, in the sysbench benchmark all transactions are read-write, so you couldn’t do read-only scale-out slaves with sysbench. But inside each sysbench transaction, roughly 75% are selects and only 25% are writes. Because of this, it is quite easy to scale a sysbench workloads b 4x simply by writing to more Galera masters.

    What you say is correct too, and both of these factors will contribute to the improvement.

    October 30, 2012 at 5:43 am
  • Sergey Reply

    Hello, Peter.

    The more I know from You about PXC, the more I want to implement it!

    Q: How Percona XtraDB Cluster handle temporary tables and heap tables? In case of creating temp/heap table on one PXC node will it be awailable for read/write from another PXC nodes?

    Thank you!

    November 2, 2012 at 5:19 am
  • Scott Haas Reply

    Peter –
    I really enjoyed watching this presentation.

    Question for you: is it possible to use MariaDB in place of Percona server?

    November 8, 2012 at 2:23 pm

Leave a Reply