Over the last year, the Ceph world drew me in. Partly because of my taste for distributed systems, but also because I think Ceph represents a great opportunity for MySQL specifically and databases in general. The shift from local storage to distributed storage is similar to the shift from bare disks host configuration to LVM-managed disks configuration.
Most of the work I’ve done with Ceph was in collaboration with folks from RedHat (mainly Brent Compton and Kyle Bader). This work resulted in a number of talks presented at the Percona Live conference in April and the RedHat Summit San Francisco at the end of June. I could write a lot about using Ceph with databases, and I hope this post is the first in a long series on Ceph. Before I starting with use cases, setup configurations and performance benchmarks, I think I should quickly review the architecture and principles behind Ceph.Introduction to Ceph
Inktank created Ceph a few years ago as a spin-off of the hosting company DreamHost. RedHat acquired Inktank in 2014 and now offers it as a storage solution. OpenStack uses Ceph as its dominant storage backend. This blog, however, focuses on a more general review and isn’t restricted to a virtual environment.
A simplistic way of describing Ceph is to say it is an object store, just like S3 or Swift. This is a true statement but only up to a certain point. There are minimally two types of nodes with Ceph, monitors and object storage daemons (OSDs). The monitor nodes are responsible for maintaining a map of the cluster or, if you prefer, the Ceph cluster metadata. Without access to the information provided by the monitor nodes, the cluster is useless. Redundancy and quorum at the monitor level are important.
Any non-trivial Ceph setup has at least three monitors. The monitors are fairly lightweight processes and can be co-hosted on OSD nodes (the other node type needed in a minimal setup). The OSD nodes store the data on disk, and a single physical server can host many OSD nodes – though it would make little sense for it to host more than one monitor node. The OSD nodes are listed in the cluster metadata (the “crushmap”) in a hierarchy that can span data centers, racks, servers, etc. It is also possible to organize the OSDs by disk types to store some objects on SSD disks and other objects on rotating disks.
With the information provided by the monitors’ crushmap, any client can access data based on a predetermined hash algorithm. There’s no need for a relaying proxy. This becomes a big scalability factor since these proxies can be performance bottlenecks. Architecture-wise, it is somewhat similar to the NDB API, where – given a cluster map provided by the NDB management node – clients can directly access the data on data nodes.
Ceph stores data in a logical container call a pool. With the pool definition comes a number of placement groups. The placement groups are shards of data across the pool. For example, on a four-node Ceph cluster, if a pool is defined with 256 placement groups (pg), then each OSD will have 64 pgs for that pool. You can view the pgs as a level of indirection to smooth out the data distribution across the nodes. At the pool level, you define the replication factor (“size” in Ceph terminology).
The recommended values are a replication factor of three for spinners and two for SSD/Flash. I often use a size of one for ephemeral test VM images. A replication factor greater than one associates each pg with one or more pgs on the other OSD nodes. As the data is modified, it is replicated synchronously to the other associated pgs so that the data it contains is still available in case an OSD node crashes.
So far, I have just discussed the basics of an object store. But the ability to update objects atomically in place makes Ceph different and better (in my opinion) than other object stores. The underlying object access protocol, rados, updates an arbitrary number of bytes in an object at an arbitrary offset, exactly like if it is a regular file. That update capability allows for much fancier usage of the object store – for things like the support of block devices, rbd devices, and even a network file systems, cephfs.
When using MySQL on Ceph, the rbd disk block device feature is extremely interesting. A Ceph rbd disk is basically the concatenation of a series of objects (4MB objects by default) that are presented as a block device by the Linux kernel rbd module. Functionally it is pretty similar to an iSCSI device as it can be mounted on any host that has access to the storage network and it is dependent upon the performance of the network.The benefits of using Ceph
In a world striving for virtualization and containers, Ceph gives easily moves database resources between hosts.
On a single host, you have access only to the IO capabilities of that host. With Ceph, you basically put in parallel all the IO capabilities of all the hosts. If each host can do 1000 iops, a four-node cluster could reach up to 4000 iops.
Ceph replicates data at the storage level, and provides resiliency to storage node crash. A kind of DRBD on steroids…
Ceph rbd block devices support snapshots, which are quick to make and have no performance impacts. Snapshots are an ideal way of performing MySQL backups.
You can clone and mount Ceph snapshots as block devices. This is a useful feature to provision new database servers for replication, either with asynchronous replication or with Galera replication.
Of course, nothing is free. Ceph use comes with some caveats.
Ceph reaction to a missing OSD
If an OSD goes down, the Ceph cluster starts copying data with fewer copies than specified. Although good for high availability, the copying process significantly impacts performance. This implies that you cannot run a Ceph with a nearly full storage, you must have enough disk space to handle the loss of one node.
The “no out” OSD attribute mitigates this, and prevents Ceph from reacting automatically to a failure (but you are then on your own). When using the “no out” attribute, you must monitor and detect that you are running in degraded mode and take action. This resembles a failed disk in a RAID set. You can choose this behavior as default with the mon_osd_auto_mark_auto_out_in setting.
Every day and every week (deep), Ceph scrubs operations that, although they are throttled, can still impact performance. You can modify the interval and the hours that control the scrub action. Once per day and once per week are likely fine. But you need to set osd_scrub_begin_hour and osd_scrub_end_hour to restrict the scrubbing to off hours. Also, scrubbing throttles itself to not put too much load on the nodes. The osd_scrub_load_threshold variable sets the threshold.
Ceph has many parameters so that tuning Ceph can be complex and confusing. Since distributed systems push hardware, properly tuning Ceph might require things like distributing interrupt load among cores and thread core pinning, handling of Numa zones – especially if you use high-speed NVMe devices.
Hopefully, this post provided a good introduction to Ceph. I’ve discussed the architecture, the benefits and the caveats of Ceph. In future posts, I’ll present use cases with MySQL. These cases include performing Percona XtraDB Cluster SST operations using Ceph snapshots, provisioning async slaves and building HA setups. I also hope to provide guidelines on how to build and configure an efficient Ceph cluster.
Finally, a note for the ones who think cost and complexity put building a Ceph cluster out of reach. The picture below shows my home cluster (which I use quite heavily). The cluster comprises four ARM-based nodes (Odroid-XU4), each with a two TB portable USB-3 hard disk, a 16 GB EMMC flash disk and a gigabit Ethernet port.
I won’t claim record breaking performance (although it’s decent), but cost-wise it is pretty hard to beat (at around $600)!
The Percona Live Europe, Amsterdam call for papers is ending soon and we are looking for MongoDB speakers! This is a great way to build your personal and company brands. If you haven’t submitted a paper yet, here are a list of ideas we would love to see covered at this conference:
If you find any of these ideas interesting, simply let @Percona know and we can help get you listed as the speaker. If nothing on this list strikes your fancy or peaks your interest, please submit a similar talk of your own – we’d love to find out what you have to say!
Want to become one of the Percona Live Europe MongoDB speakers? Read this blog for details.
The Percona Live Europe, Amsterdam call for papers is ending soon, and we are looking for MongoDB speakers! This is a great way to build your personal and company brands. If you haven’t submitted a paper yet, here are a list of ideas we would love to see covered at this conference:
If you any these ideas interesting, simply let @Percona know and we can help get you listed as the speaker. If nothing on this list strikes your fancy or peaks your interest, please submit a similar talk of your own – we’d love to find out what you have to say!
Here are some other ideas that might get your thoughts bubbling:
The Percona team and conference commitee would love to see what other ideas the community has that we haven’t covered. Anything helps: using @Percona and mentioning topics you would like to see, to sharing topics on twitter you like, or even just sharing the link to the call for papers.
The call for papers closes next Monday (7/18), so let’s get some great things in this week and build a truly dynamic conference!
Percona Server for MongoDB 3.0.12-1.8 is an enhanced, open source, fully compatible, highly scalable, zero-maintenance downtime database supporting the MongoDB v3.0 protocol and drivers. Based on MongoDB 3.0.12, it extends MongoDB with MongoRocks and PerconaFT storage engines, as well as features like external authentication and audit logging. Percona Server for MongoDB requires no changes to MongoDB applications or code.
NOTE: The MongoRocks storage engine is still under development. There is currently no officially released version of MongoRocks recommended for production.
This release includes all changes from MongoDB 3.0.12, and the following known issue that will be fixed in a future release:
You can find the release notes in the official documentation.
Please join Sveta Smirnova for a webinar Thursday, July 14 at 10 am PDT (UTC-7) on an Introduction Into Storage Engine Troubleshooting.
The number of MySQL storage engines provide great flexibility for database users, administrators and developers. At the same time, engines add an extra level of complexity when it comes to troubleshooting issues. Before choosing the right troubleshooting tool, you need to answer the following questions (and often others):
This webinar will discuss these questions and how to find the right answers across all storage engines in a general sense.
You will also learn:
. . . and more.
Note: We will hold a separate webinar specifically for InnoDB.Sveta Smirnova, Principal Technical Services Engineer Sveta joined Percona in 2015. Her main professional interests are problem-solving, working with tricky issues, bugs, finding patterns which can solve typical issues quicker, teaching others how to deal with MySQL issues, bugs and gotchas effectively. Before joining Percona Sveta worked as a Support Engineer in the MySQL Bugs Analysis Support Group in MySQL AB-Sun-Oracle. She is the author of the book “MySQL Troubleshooting” and JSON UDF functions for MySQL.
For general inquiries, please send us your question and someone will contact you.