September 18, 2014

High Availability with MySQL Fabric: Part I

In our previous post, we introduced the MySQL Fabric utility and said we would dig deeper into it. This post is the first part of our test of MySQL Fabric’s High Availability (HA) functionality.

Today, we’ll review MySQL Fabric’s HA concepts, and then walk you through the setup of a 3-node cluster with one Primary and two Secondaries, doing a few basic tests with it. In a second post, we will spend more time generating failure scenarios and documenting how Fabric handles them. (MySQL Fabric is an extensible framework to manage large farms of MySQL servers, with support for high-availability and sharding.)

Before we begin, we recommend you read this post by Oracle’s Mats Kindahl, which, among other things, addresses the issues we raised on our first post. Mats leads the MySQL Fabric team.

Our lab

All our tests will be using our test environment with Vagrant (https://github.com/martinarrieta/vagrant-fabric)

If you want to play with MySQL Fabric, you can have these VMs running in your desktop following the instructions in the README file. If you don’t want full VMs, our colleague Jervin Real created a set of wrapper scripts that let you test MySQL Fabric using sandboxes.

Here is a basic representation of our environment.

Fabric Lab

Set up

To set up MyQSL Fabric without using our Vagrant environment, you can follow the official documentation, or check the ansible playbooks in our lab repo. If you follow the manual, the only caveat is that when creating the user, you should either disable binary logging for your session, or use a GRANT statement instead of CREATE USER. You can read here for more info on why this is the case.

A description of all the options in the configuration file can be found here. For HA tests, the one thing to mention is that, in our experience, the failure detector will only trigger an automatic failover if the value for failover_interval in the [failure_tracking] section is greater than 0. Otherwise, failures will be detected and written to the log, but no action will be taken.

MySQL configuration

In order to manage a mysqld instance with MySQL Fabric, the following options need to be set in the [mysqld] section of its my.cnf file:

Additionally, as in any replication setup, you must make sure that all servers have a distinct server_id.

When everything is in place, you can setup and start MySQL Fabric with the following commands:

The setup command creates the database schema used by MySQL Fabric to store information about managed servers, and the start one, well, starts the daemon. The –daemon option makes Fabric start as a daemon, logging to a file instead of to standard output. Depending on the port and file name you configured in fabric.cfg, this may need to be run as root.

While testing, you can make MySQL Fabric reset its state at any time (though it won’t change existing node configurations such as replication) by running:

If you’re using our Vagrant environment, you can run the reinit_cluster.sh script from your host OS (from the root of the vagrant-fabric repo) to do this for you, and also initialise the datadir of the three instances.

Creating a High Availability Cluster:

A High Availability Cluster is a set of servers using the standard Asynchronous MySQL Replication with GTID.

Creating a group

The first step is to create the group by running mysqlfabric with this syntax:

In our example, to create the cluster “mycluster” you can run:

Add the servers to the group

The second step is add the servers to the group. The syntax to add a server to a group is:

The port number is optional and only required if distinct from 3306. It is important to mention that the clients that will use this cluster must be able to resolve this host or IP. This is because clients will connect directly both with MySQL Fabric’s XML-PRC server and with the managed mysqld servers. Let’s add the nodes to our group.

Promote a node as a master

Now that we have all our nodes in the group, we have to promote one of them. You can promote one specific node or you can let MySQL Fabric to choose one for you.

The syntax to promote a specific node is:

or to let MySQL Fabric pick one:

Let’s do that:

You can then check the health of the group like this:

One current limitation of the ‘health’ command is that it only identifies servers by their uuid. To get a list of the servers in a group, along with quick status summary, and their host names, use lookup_servers instead:

We sent a merge request to use a Json string instead of the “print” of the object in the “return” field from the XML-RPC in order to be able to use that information to display the results in a friendly way. In the same merge, we have added the address of the servers in the health command too.

Failure detection

Now we have the three lab machines set up in a replication topology of one master (the PRIMARY server) and two slaves (the SECONDARY ones). To make MySQL Fabric start monitoring the group for problems, you need to activate it:

Now MySQL Fabric will monitor the group’s servers, and depending on the configuration (remember the failover_interval we mentioned before) it may trigger an automatic failover. But let’s start testing a simpler case, by stopping mysql on one of the secondary nodes:

And checking how MySQL Fabric report’s the group’s health after this:

We can see that MySQL Fabric successfully marks the server as faulty. In our next post we’ll show an example of this by using one of the supported connectors to handle failures in a group, but for now, let’s keep on the DBA/sysadmin side of things, and try to bring the server back online:

So the server is back online, but Fabric still considers it faulty. To add the server back into rotation, we need to look at the server commands:

The specific command we need is set_status, and in order to add the server back to the group, we need to change it’s status twice: first to SPARE and then back to SECONDARY. You can see what happens if we try to set it to SECONDARY directly:

So let’s try it the right way:

And check the group’s health again:

In our next post, when we discuss how to use the Fabric aware connectors, we’ll also test other failure scenarios like hard VM shutdown and network errors, but for now, let’s try the same thing but on the PRIMARY node instead:

And let’s check the servers again:

We can see that MySQL Fabric successfully marked node3 as FAULTY, and promoted node2 to PRIMARY to resolve this. Once we start mysqld again on node3, we can add it back as SECONDARY using the same process of setting it’s status to SPARE first, as we did for node2 above.

Remember that unless failover_interval is greater than 0, MySQL Fabric will detect problems in an active group, but it won’t take any automatic action. We think it’s a good thing that the value for this variable in the documentation is 0, so that automatic failover is not enabled by default (if people follow the manual, of course), as even in mature HA solutions like Pacemaker, automatic failover is something that’s tricky to get right. But even without this, we believe the main benefit of using MySQL Fabric for promotion is that it takes care of reconfiguring replication for you, which should reduce the risk for error in this process, specially once the project becomes GA.

What’s next

In this post we’ve presented a basic replication setup managed by MySQL Fabric and reviewed a couple of failure scenarios, but many questions are left unanswered, among them:

  • What happens to clients connected with a Fabric aware driver when there is a status change in the cluster?
  • What happens when the XML-RPC server goes down?
  • How can we improve its availability?

We’ll try to answer these and other questions in our next post. If you have some questions of your own, please leave them in the comments section and we’ll address them in the next or other posts, depending on the topic.

About Martin Arrieta

Martin joined Percona in January 2012. He has been using Linux and open source technologies since 1999. Martin has worked with Apache, DNS's, mail servers, iptables and MySQL servers.

Comments

  1. MWM says:

    Hi,

    I am facing the following issue while adding mysql instance into the group

    mysqlfabric group create group1 192.168.0.211:3306

    Result:
    success : false
    return : server error : Error Accessing Server (192.168.0.211:3306)

    What could be the reasons.

    Waiting for your positive reply

    Regards
    MWM

  2. @MWM: The group create command does not need a host, so just:

    > mysqlfabric group create group1

    should work.

    You can check the help to confirm this on your version:

    > [vagrant@store ~]$ mysqlfabric help group create
    > group create group_id [--description=NONE] [--synchronous]
    >
    > Create a group.

    That said, if you experience this error with other commands (like mysqlfabric group add group1 192.168.0.211:3306), my guess would be that either the fabric user is missing on that mysqld instance, or that it does not allow connections from the host you’re running the mysqlfabric command.

    You can verify what IP or host name this is if you have the error log enabled on the 0.211:3306 server and you set log_warnings to something greater than 1 (see http://dev.mysql.com/doc/refman/5.6/en/server-system-variables.html#sysvar_log_warnings for a full description).

  3. MWM says:

    Dear Fernando,

    Thank you for replying.

    You are right, I mistakenly wrote “create” instead of “add” in the question.

    Secondly, my machine @192.168.0.211 allows connection with user and password. So, issue was with the fabric.cfg file. I did’nt give any value against ‘password’ parameter of [servers] section.

    After setting the above mentioned parameter, utility allowed to add MySQL instance into the group.

    Regards
    MWM

  4. Drew Schatt says:

    [root@store vagrant]# mysqlfabric manage start –daemon
    mysqlfabric: error: no such option: –daemon

    This is having pulled down the git repo and following the instructions… Perhaps the documentation needs to be updated?

  5. @Drew:

    This is due to this: http://bugs.mysql.com/bug.php?id=72818

    But you’re right, while this is not resolved, we need to update the instructions. Just running mysqlfabric manage start should be enough, though it will stay attached to your stdin/stdout, so you may want to use nohup and redirect the output to a log file instead.

  6. Tim says:

    When I use the following command
    “mysqlfabric group promote –slave_uuid=””,
    I got mysqlfabric: error: no such option: –slave_uuid.

    The command sould be
    “mysqlfabric group promote –slave_id=””.
    Am I right?

  7. @Tim:

    You’re right, thanks for pointing that out!
    I’ve corrected the example.

  8. Lakshma says:

    @Martin Arrieta and Fernando Ipar and @ALL fabric enthusiast,

    Nice blog, am playing around HA aspect of Fabric and am struck. I have backing store and 3 mysql nodes on different machines. i created fabric user on backing store to access fabric database and fabric user on mysql nodes for fabric system to connect to mysql nodes. i created ha group and added 3 hosts to the group. when i promote the group the selection of master based on our choice and also auto pick is doing great. The problem starts with the secondary nodes. They are trying to connect to master using user fabric@masterhost(mysql host that is picked as primary when i promoted the group) and having error ‘io_running': False, ‘io_error': “error connecting to master ‘fabric@masterhost:3306. Funny part when i do show slave status on secondary nodes i see slave_SQL_running=yes, slave_IO_running says connecting and Last_IO_Error: error connecting to master ‘fabric@masterhostname:3306′ – retry-time: 60 retries: 4.
    I may be missing the step of setting up user credentials in fabric.cfg for servers to use for setting up replication among the ha group hosts. Any help is appreciated.

    Thanks
    DL

  9. @Lakshma:

    Thanks for taking the time to comment.
    You’re right in that you’re probably missing a step while setting up credentials. I’d start looking at the error log on the (elected) master node, with @@log_warnings > 1 (i.e. run “set @@global.log_warnings=2;” on it, as root).

    With that variable set, you should see failed connection attempts in the error log, which should help you create any needed user. After setting this variable, error log messages should look like this:

    140813 18:55:54 [Warning] Access denied for user ‘test’@’localhost’ (using password: YES)

Speak Your Mind

*