Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Percona XtraDB Cluster on Ceph

August 4, 2016

Author

Yves Trudeau

MySQL

Percona Software

Share this Post:

This post discusses how XtraDB Cluster and Ceph are a good match, and how their combination allows for faster SST and a smaller disk footprint.

My last post was an introduction to Red Hat’s Ceph. As interesting and useful as it was, it wasn’t a practical example. Like most of the readers, I learn about and see the possibilities of technologies by burning my fingers on them. This post dives into a real and novel Ceph use case: handling of the Percona XtraDB Cluster SST operation using Ceph snapshots.

If you are familiar with Percona XtraDB Cluster, you know that a full state snapshot transfer (SST) is required to provision a new cluster node. Similarly, SST can also be triggered when a cluster node happens to have a corrupted dataset. Those SST operations consist essentially of a full copy of the dataset sent over the network. The most common SST methods are Xtrabackup and rsync. Both of these methods imply a significant impact and load on the donor while the SST operation is in progress.

For example, the whole dataset will need to be read from the storage and sent over the network, an operation that requires a lot of IO operations and CPU time. Furthermore, with the rsync SST method, the donor is under a read lock for the whole duration of the SST. Consequently, it can take no write operations. Such constraints on SST operations are often the main motivations beyond the reluctance of using Percona XtraDB cluster with large datasets.

So, what could we do to speed up SST? In this post, I will describe a method of performing SST operations when the data is not local to the nodes. You could easily modify the solution I am proposing for any non-local data source technology that supports snapshots/clones, and has an accessible management API. Off the top of my head (other than Ceph) I see AWS EBS and many SAN-based storage solutions as good fits.

The challenges of clone-based SST

If we could use snapshots and clones, what would be the logical steps for an SST? Let’s have a look at the following list:

1. New node starts (joiner) and unmounts its current MySQL datadir

1. The joiner and asks for an SST

1. The donor creates a consistent snapshot of its MySQL datadir with the Galera position

1. The donor sends to the joiner the name of the snapshot to use

1. The joiner creates a clone of the snapshot name provided by the donor

1. The joiner mounts the snapshot clone as the MySQL datadir and adjusts ownership

1. The joiner initializes MySQL on the mounted clone

As we can see, all these steps are fairly simple, but hide some challenges for an SST method base on cloning. The first challenge is the need to mount the snapshot clone. Mounting a block device requires root privileges – and SST scripts normally run under the MySQL user. The second challenge I encountered wasn’t expected. MySQL opens the datadir and some files in it before the SST happens. Consequently, those files are then kept opened in the underlying mount point, a situation that is far from ideal. Fortunately, there are solutions to both of these challenges as we will see below.

SST script

So, let’s start with the SST script. The script is available in my Github at:

https://github.com/y-trudeau/ceph-related-tools/raw/master/wsrep-sst/wsrep_sst_ceph

You should install the script in the /usr/bin directory, along with the other user scripts. Once installed, I recommend:

chown root.root /usr/bin/wsrep_sst_ceph
chmod 755 /usr/bin/wsrep_sst_ceph

1 2	chown root.root /usr/bin/wsrep_sst_ceph chmod 755 /usr/bin/wsrep_sst_ceph

The script has a few parameters that can be defined in the [sst] section of the my.cnf file.

cephlocalpool

The Ceph pool where this node should create the clone. It can be a different pool from the one of the original dataset. For example, it could have a replication factor of 1 (no replication) for a read scaling node. The default value is: mysqlpool

cephmountpoint

What mount point to use. It defaults to the MySQL datadir as provided to the SST script.

cephmountoptions

The options used to mount the filesystem. The default value is: rw,noatime

cephkeyring

The Ceph keyring file to authenticate against the Ceph cluster with cephx. The user under which MySQL is running must be able to read the file. The default value is: /etc/ceph/ceph.client.admin.keyring

cephcleanup

Whether or not the script should cleanup the snapshots and clones that are no longer is used. Enable = 1, Disable = 0. The default value is: 0

Root privileges

In order to allow the SST script to perform privileged operations, I added an extra SST role: “mount”. The SST script on the joiner will call itself back with sudo and will pass “mount” for the role parameter. To allow the elevation of privileges, the follow line must be added to the /etc/sudoers file:

mysql ALL=NOPASSWD: /usr/bin/wsrep_sst_ceph

1	mysql ALL=NOPASSWD: /usr/bin/wsrep_sst_ceph

Files opened by MySQL before the SST

Upon startup, MySQL opens files at two places in the code before the SST completes. The first one is in the function mysqld_main , which sets the current working directory to the datadir (an empty directory at that point). After the SST, a block device is mounted on the datadir. The issue is that MySQL tries to find the files in the empty mount point directory. I wrote a simple patch, presented below, and issued a pull request:

diff --git a/sql/mysqld.cc b/sql/mysqld.cc
index 90760ba..bd9fa38 100644
--- a/sql/mysqld.cc
+++ b/sql/mysqld.cc
@@ -5362,6 +5362,13 @@ a file name for --log-bin-index option", opt_binlog_index_name);
       }
     }
   }
+
+  /* 
+   * Forcing a new setwd in case the SST mounted the datadir
+   */
+  if (my_setwd(mysql_real_data_home,MYF(MY_WME)) && !opt_help)
+    unireg_abort(1);        /* purecov: inspected */
+
   if (opt_bin_log)
   {
     /*

diff --git a/sql/mysqld.cc b/sql/mysqld.cc

index 90760ba..bd9fa38 100644

--- a/sql/mysqld.cc

+++ b/sql/mysqld.cc

@@ -5362,6 +5362,13 @@ a file name for --log-bin-index option", opt_binlog_index_name);

}

+ /*

+ * Forcing a new setwd in case the SST mounted the datadir

+ */

+ if (my_setwd(mysql_real_data_home,MYF(MY_WME)) && !opt_help)

+ unireg_abort(1); /* purecov: inspected */

if (opt_bin_log)

{

With this patch, I added a new my_setwd call right after the SST completed. The Percona engineering team approved the patch, and it should be added to the upcoming release of Percona XtraDB Cluster.

The Galera library is the other source of opened files before the SST. Here, the fix is just in the configuration. You must define the base_dir Galera provider option outside of the datadir. For example, if you use /var/lib/mysql as datadir and cephmountpoint, then you should use:

wsrep_provider_options="base_dir=/var/lib/galera"

1	wsrep_provider_options="base_dir=/var/lib/galera"

Of course, if you have other provider options, don’t forget to add them there.

Walkthrough

So, what are the steps required to use Ceph with Percona XtraDB Cluster? (I assume that you have a working Ceph cluster.)

1. Join the Ceph cluster

The first thing you need is a working Ceph cluster with the needed CephX credentials. While the setup of a Ceph cluster is beyond the scope of this post, we will address it in a subsequent post. For now, we’ll focus on the client side.

You need to install the Ceph client packages on each node. On my test servers using Ubuntu 14.04, I did:

wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add -
sudo apt-add-repository 'deb http://download.ceph.com/debian-infernalis/ trusty main'
apt-get update 
apt-get install ceph

wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add -

sudo apt-add-repository 'deb http://download.ceph.com/debian-infernalis/ trusty main'

apt-get update

apt-get install ceph

These commands also installed all the dependencies. Next, I copied the Ceph cluster configuration file /etc/ceph/ceph.conf:

[global]
fsid = 87671417-61e4-442b-8511-12659278700f
mon_initial_members = odroid1, odroid2
mon_host = 10.2.2.100, 10.2.2.20, 10.2.2.21
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_journal = /var/lib/ceph/osd/journal
osd_journal_size = 128
osd_pool_default_size = 2

[global]

fsid = 87671417-61e4-442b-8511-12659278700f

mon_initial_members = odroid1, odroid2

mon_host = 10.2.2.100, 10.2.2.20, 10.2.2.21

auth_cluster_required = cephx

auth_service_required = cephx

auth_client_required = cephx

filestore_xattr_use_omap = true

osd_journal = /var/lib/ceph/osd/journal

osd_journal_size = 128

osd_pool_default_size = 2

and the authentication file /etc/ceph/ceph.client.admin.keyring from another node. I made sure these files were readable by all. You can define more refined privileges for a production system with CephX, the security layer of Ceph.

Once everything is in place, you can test if it is working with this command:

root@PXC3:~# ceph -s
    cluster 87671417-61e4-442b-8511-12659278700f
     health HEALTH_OK
     monmap e2: 3 mons at {odroid1=10.2.2.20:6789/0,odroid2=10.2.2.21:6789/0,serveur-famille=10.2.2.100:6789/0}
            election epoch 474, quorum 0,1,2 odroid1,odroid2,serveur-famille
     mdsmap e204: 1/1/1 up {0=odroid3=up:active}
     osdmap e995: 4 osds: 4 up, 4 in
      pgmap v275501: 1352 pgs, 5 pools, 321 GB data, 165 kobjects
            643 GB used, 6318 GB / 7334 GB avail
                1352 active+clean
  client io 16491 B/s rd, 2425 B/s wr, 1 op/s

root@PXC3:~# ceph -s

cluster 87671417-61e4-442b-8511-12659278700f

health HEALTH_OK

monmap e2: 3 mons at {odroid1=10.2.2.20:6789/0,odroid2=10.2.2.21:6789/0,serveur-famille=10.2.2.100:6789/0}

election epoch 474, quorum 0,1,2 odroid1,odroid2,serveur-famille

mdsmap e204: 1/1/1 up {0=odroid3=up:active}

osdmap e995: 4 osds: 4 up, 4 in

pgmap v275501: 1352 pgs, 5 pools, 321 GB data, 165 kobjects

643 GB used, 6318 GB / 7334 GB avail

1352 active+clean

client io 16491 B/s rd, 2425 B/s wr, 1 op/s

Which gives the current state of the Ceph cluster.

2. Create the Ceph pool

Before we can use Ceph, we need to create a first RBD image, put a filesystem on it and mount it for MySQL on the bootstrap node. We need at least one Ceph pool since the RBD images are stored in a Ceph pool. We create a Ceph pool with the command:

ceph osd pool create mysqlpool 512 512 replicated

1	ceph osd pool create mysqlpool 512 512 replicated

Here, we have defined the pool mysqlpool with 512 placement groups. On a larger Ceph cluster, you might need to use more placement groups (again, a topic beyond the scope of this post). The pool we just created is replicated. Each object in the pool will have two copies as defined by the osd_pool_default_size parameter in the ceph.conf file. If needed, you can modify the size of a pool and its replication factor at any moment after the pool is created.

3. Create the first RBD image

Now that we have a pool, we can create a first RBD image:

root@PXC1:~# rbd -p mysqlpool create PXC --size 10240 --image-format 2

1	root@PXC1:~# rbd -p mysqlpool create PXC --size 10240 --image-format 2

and “map” the RBD image to a host block device:

root@PXC1:~# rbd -p mysqlpool map PXC
/dev/rbd1

1 2	root@PXC1:~# rbd -p mysqlpool map PXC /dev/rbd1

The commands return the local RBD block device that corresponds to the RBD image. The other steps are not specific to RBD images, we need to create a filesystem and prepare the mount points.

The rest of the steps are not specific to RBD images. We need to create a filesystem and prepare the mount points:

mkfs.xfs /dev/rbd1
mount /dev/rbd1 /var/lib/mysql -o rw,noatime,nouuid
chown mysql.mysql /var/lib/mysql
mysql_install_db --datadir=/var/lib/mysql --user=mysql
mkdir /var/lib/galera
chown mysql.mysql /var/lib/galera

mkfs.xfs /dev/rbd1

mount /dev/rbd1 /var/lib/mysql -o rw,noatime,nouuid

chown mysql.mysql /var/lib/mysql

mysql_install_db --datadir=/var/lib/mysql --user=mysql

mkdir /var/lib/galera

chown mysql.mysql /var/lib/galera

You need to mount the RBD device and run the mysql_install_db tool only on the bootstrap node. You need to create the directories /var/lib/mysql and /var/lib/galera on the other nodes and adjust the permissions similarly.

4. Modify the my.cnf files

You will need to set or adjust the specific wsrep_sst_ceph settings in the my.cnf file of all the servers. Here are the relevant lines from the my.cnf file of one of my cluster node:

[mysqld]
wsrep_provider=/usr/lib/libgalera_smm.so
wsrep_provider_options="base_dir=/var/lib/galera"
wsrep_cluster_address=gcomm://10.0.5.120,10.0.5.47,10.0.5.48
wsrep_node_address=10.0.5.48
wsrep_sst_method=ceph
wsrep_cluster_name=ceph_cluster

[sst]
cephlocalpool=mysqlpool
cephmountoptions=rw,noatime,nodiratime,nouuid
cephkeyring=/etc/ceph/ceph.client.admin.keyring
cephcleanup=1

[mysqld]

wsrep_provider=/usr/lib/libgalera_smm.so

wsrep_provider_options="base_dir=/var/lib/galera"

wsrep_cluster_address=gcomm://10.0.5.120,10.0.5.47,10.0.5.48

wsrep_node_address=10.0.5.48

wsrep_sst_method=ceph

wsrep_cluster_name=ceph_cluster

[sst]

cephlocalpool=mysqlpool

cephmountoptions=rw,noatime,nodiratime,nouuid

cephkeyring=/etc/ceph/ceph.client.admin.keyring

cephcleanup=1

At this point, we can bootstrap the cluster on the node where we mounted the initial RBD image:

/etc/init.d/mysql bootstrap-pxc

1	/etc/init.d/mysql bootstrap-pxc

5. Start the other XtraDB Cluster nodes

The first node does not perform an SST, so nothing exciting so far. With the patched version of MySQL (the above patch), starting MySQL on a second node triggers a Ceph SST operation. In my test environment, the SST take about five seconds to complete on low-powered VMs. Interestingly, the duration is not directly related to the dataset size. Because of this, a much larger dataset, on a quiet database, should take about the exact same time. A very busy database may need more time, since an SST requires a “flush tables with read lock” at some point.

So, after their respective Ceph SST, the other two nodes have:

root@PXC2:~# mount | grep mysql
/dev/rbd1 on /var/lib/mysql type xfs (rw,noatime,nodiratime,nouuid)
root@PXC2:~# rbd showmapped
id pool      image           snap device    
1  mysqlpool PXC2-1463776424 -    /dev/rbd1

root@PXC3:~# mount | grep mysql
/dev/rbd1 on /var/lib/mysql type xfs (rw,noatime,nodiratime,nouuid)
root@PXC3:~# rbd showmapped
id pool      image           snap device    
1  mysqlpool PXC3-1464118729 -    /dev/rbd1

root@PXC2:~# mount | grep mysql

/dev/rbd1 on /var/lib/mysql type xfs (rw,noatime,nodiratime,nouuid)

root@PXC2:~# rbd showmapped

id pool image snap device

1 mysqlpool PXC2-1463776424 - /dev/rbd1

root@PXC3:~# mount | grep mysql

/dev/rbd1 on /var/lib/mysql type xfs (rw,noatime,nodiratime,nouuid)

root@PXC3:~# rbd showmapped

id pool image snap device

1 mysqlpool PXC3-1464118729 - /dev/rbd1

The original RBD image now has two snapshots that are mapped to the clones mounted by other two nodes:

root@PXC3:~# rbd -p mysqlpool ls
PXC
PXC2-1463776424
PXC3-1464118729
root@PXC3:~# rbd -p mysqlpool info PXC2-1463776424
rbd image 'PXC2-1463776424':
        size 10240 MB in 2560 objects
        order 22 (4096 kB objects)
        block_name_prefix: rbd_data.108b4246146651
        format: 2
        features: layering
        flags:
        parent: mysqlpool/PXC@1463776423
        overlap: 10240 MB

root@PXC3:~# rbd -p mysqlpool ls

PXC

PXC2-1463776424

PXC3-1464118729

root@PXC3:~# rbd -p mysqlpool info PXC2-1463776424

rbd image 'PXC2-1463776424':

size 10240 MB in 2560 objects

order 22 (4096 kB objects)

block_name_prefix: rbd_data.108b4246146651

format: 2

features: layering

flags:

parent: mysqlpool/PXC@1463776423

overlap: 10240 MB

Discussion

Apart from allowing faster SST, what other benefits do we get from using Ceph with Percona XtraDB Cluster?

The first benefit is the inherent data duplication over the network removes the need for local data replication. Thus, instead of using raid-10 or raid-5 with an array of disks, we could use a simple raid-0 stripe set if the data is already replicated to more than one server.

The second benefit is a bit less obvious: you don’t need as much storage. Why? A Ceph clone only stores the delta from its original snapshot. So, for large, read intensive datasets, the disk space savings can be very significant. Of course, over time, the clone will drift away from its parent snapshot and will use more and more space. When we determine that a Ceph clone uses too much disk space, we can simply refresh the clone by restarting MySQL and forcing a full SST. The SST script will automatically drop the old clone and snapshot when the cephcleanup option is set, and it will create a new fresh clone. You can easily evaluate how much space is consumed by the clone using the following commands:

root@PXC2:~# rbd -p mysqlpool du PXC2-1463776424
warning: fast-diff map is not enabled for PXC2-1463776424. operation may be slow.
NAME            PROVISIONED USED 
PXC2-1463776424      10240M 164M

root@PXC2:~# rbd -p mysqlpool du PXC2-1463776424

warning: fast-diff map is not enabled for PXC2-1463776424. operation may be slow.

NAME PROVISIONED USED

PXC2-1463776424 10240M 164M

Also, nothing prevents you using a different configuration of Ceph pools in the same XtraDB cluster. Therefore a Ceph clone can use a different pool than its parent snapshot. That’s the whole purpose of the cephlocalpool parameter. Strictly speaking, you only need one node to use a replicated pool, as the other nodes could run on clones that are stored data in a non-replicated pool (saving a lot of storage space). Furthermore, we can define the OSD affinity of the non-replicated pool in a way that it stores data on the host where it is used, reducing the cross node network latency.

Using Ceph for XtraDB Cluster SST operation demonstrates one of the array of possibilities offered to MySQL by Ceph. We continue to work with the Red Hat team and Red Hat Ceph Storage architects to find new and useful ways of addressing database issues in the Ceph environment. There are many more posts to come, so stay tuned!