I recently became interested in a newer project they have called ‘Consul‘. Consul is a bit hard to describe. It is (in part):
- Highly consistent metadata store (a bit like Zookeeeper)
- A monitoring system (lightweight Nagios)
- A service discovery system, both DNS and HTTP-based. (think of something like haproxy, but instead of tcp load balancing, it provides dns lookups with healthy services)
What this has to do with Percona XtraDB Cluster
I’ve had some more complex testing for Percona XtraDB Cluster (PXC) to do on my plate for quite a while, and I started to explore Consul as a tool to help with this. I already have Vagrant setups for PXC, but ensuring all the nodes are healthy, kicking off tests, gathering results, etc. were still difficult.
So, my loose goals for Consul are:
- A single dashboard to ensure my testing environment is healthy
- Ability to adapt to any size environment — 3 node clusters up to 20+
- Coordinate starting and stopping load tests running on any number of test clients
- Have the ability to collect distributed test results
I’ve succeeded on some of these fronts with a Vagrant environment I’ve been working on. This spins up:
- A Consul cluster (default is a single node)
- Test server(s)
- A PXC cluster
Additionally, it integrates the Test servers and PXC nodes with Consul such that:
- The servers setup a Consul agent in client mode to the Consul cluster
- Additionally, they setup a local DNS forwarder that sends all DNS requests to the ‘.consul’ domain to the local agent to be serviced by the Consul cluster.
- The servers register services with Consul that run local health checks
- The test server(s) setup a ‘watch’ in consul to wait for starting sysbench on a consul ‘event’.
Seeing it in action
Once I run my ‘vagrant up’, I get a consul UI I can connect to on my localhost at port 8501:
I can see all 5 of my nodes. I can check the services and see that test1 is failing one health check because sysbench isn’t running yet:
This is expected, because I haven’t started testing yet. I can see that my PXC cluster is healthy:
Involving Percona Cloud Tools in the system
So far, so good. This Vagrant configuration (if I provide a PERCONA_AGENT_API_KEY in my environment) also registers my test servers with Percona Cloud Tools, so I can see data being reported there for my nodes:
So now I am ready to begin my test. To do so, I simply need to issue a consul event from any of the nodes:
jayj@~/Src/pxc_consul $ vagrant ssh consul1
Last login: Wed Nov 26 14:32:38 2014 from 10.0.2.2
[root@consul1 ~]# consul event -name='sysbench_update_index'
Event ID: 7c8aab42-fd2e-de6c-cb0c-1de31c02ce95
My pre-configured watchers on my test node knows what to do with that event and launches sysbench. Consul shows that sysbench is indeed running:
And I can indeed see traffic start to come in on Percona Cloud Tools:
I have testing traffic limited for my example, but that’s easily tunable via the Vagrantfile. To show something a little more impressive, here’s a 5 node cluster running hitting around 2500 tps total throughput:
So to summarize thus far:
- I can spin up any size cluster I want and verify it is healthy with Consul’s UI
- I can spin up any number of test servers and kick off sysbench on all of them simultaneously
Another big trick of Consul’s
That so far so good, but let me point out a few things that may not be obvious. If you check the Vagrantfile, I use a consul hostname in a few places. First, on the test servers:
# sysbench setup
'tables' => 1,
'rows' => 1000000,
'threads' => 4 * pxc_nodes,
'tx_rate' => 10,
'mysql_host' => 'pxc.service.consul'
then again on the PXC server configuration:
# PXC setup
"percona_server_version" => pxc_version,
'innodb_buffer_pool_size' => '1G',
'innodb_log_file_size' => '1G',
'innodb_flush_log_at_trx_commit' => '0',
'pxc_bootstrap_node' => (i == 1 ? true : false ),
'wsrep_cluster_address' => 'gcomm://pxc.service.consul',
'wsrep_provider_options' => 'gcache.size=2G; gcs.fc_limit=1024',
Notice ‘pxc.service.consul’. This hostname is provided by Consul and resolves to all the IPs of the current servers both having and passing the ‘pxc’ service health check:
[root@test1 ~]# host pxc.service.consul
pxc.service.consul has address 172.28.128.7
pxc.service.consul has address 172.28.128.6
pxc.service.consul has address 172.28.128.5
So I am using this to my advantage in two ways:
- My PXC cluster bootstraps the first node automatically, but all the other nodes use this hostname for their wsrep_cluster_address. This means: no specific hostnames or ips in the my.cnf file, and this hostname will always be up to date with what nodes are active in the cluster; which is the precise list that should be in the wsrep_cluster_address at any given moment.
- My test servers connect to this hostname, therefore they always know where to connect and they will round-robin (if I have enough sysbench threads and PXC nodes) to different nodes based on the response of the dns lookup, which returns 3 of the active nodes in a different order each time.
(Some of) The Issues
This is still a work in progress and there are many improvements that could be made:
- I’m relying on PCT to collect my data, but it’d be nice to utilize Consul’s central key/value store to store results of the independent sysbench runs.
- Consul’s leader election could be used to help the cluster determine which node should bootstrap on first startup. I am assuming node1 should bootstrap.
- A variety of bugs in various software still makes this a bit clunky sometimes to manage. Here is a sample:
- Consul events sometimes don’t fire in the current release (though it looks to be fixed soon)
- PXC joining nodes sometimes get stuck putting speed bumps into the automated deploy.
- Automated installs of percona-agent (which sends data to Percona Cloud Tools) is straight-forward, except when different cluster nodes clobber each other’s credentials.
So, in summary, I am happy with how easily Consul integrates and I’m already finding it useful for a product in its 0.4.1 release.