Facebook, with 1.49 billion monthly active users, is one of the world’s top MySQL users. Simon Martin, a production engineer on Facebook’s MySQL Infrastructure team, has been working with MySQL for most of his career, starting from 2 servers built out of spare parts and moving through to one of the largest deployments in the world.
Simon will be sharing some of the challenges Facebook has tackled as it rolled out semi-synchronous replication across our different services at Percona Live Amsterdam on Sept. 22. His talk is aptly titled, “The highs and lows of semi-synchronous replication.” I sat down, virtually, with Simon the other day. Our conversation is below, but first, as a special reward to my readers, save €20 on your Percona Live registration by entering promo code “BlogInterview” at registration. Please feel free to share this offer!
Tom: On a scale from 1-10, how important is MySQL to Facebook? And how does Facebook use MySQL?
Simon: 10. We have a sophisticated in memory caching layer that will serve most requests, but MySQL is the persistent store for our graph. This means all your profile data, all your friends, likes and comments and the same for pages, events, places and the rest are stored permanently in MySQL.
We rely on MySQL in this role for 3 key features. Firstly as the final store it needs to not lose data, and InnoDB is well proven in this space. It needs to be highly available, MySQL and InnoDB are both very stable and we use replication as well to provide redundancy. Finally, even with extensive caching, it needs to be performant, both in latency and throughput, MySQL is both and we can use replication again to spread the read traffic to slaves in remote regions to help here too.
Tom: What are some of the advantages of using Semi-Synchronous Replication at Facebook — and what are the challenges for deployments of that size when using it?
Simon: That’s a big question, I could probably talk for 50 minutes on it! We started looking at Semi-Synchronous as a solution to reduce downtime when a MySQL master, or the host it’s on, crashes. Historically, if you are running a replicated environment and the master crashes, you are faced with a choice. You could promote another slave right away to reduce downtime, but it’s impossible to be sure that any of your slaves got all the transactions off the master. At Facebook we cannot lose people’s data, so we always chose to recover the master and re-connect the slaves before promoting if required. The downside is recovering InnoDB on a busy host can be slow, and if the host is rebooted it will be even slower, giving us many minutes of downtime.
Now that we run Semi-Synchronous replication it means that a master will not commit a transaction until at least one slave has acknowledged receipt of the binary logs for that transaction. With this running when a master crashes we can be sure our most up-to-date slave has all the data, so once it’s applied by the SQL thread we can promote safely without waiting for crash recovery.
There are many challenges in this though. Firstly there is performance, we now need a network round trip for each transaction, so we need the acknowledging slaves to be very close. Slaves in a different data hall, let-alone a different region, will be too slow.
We also need to pay attention to slave availability, previously not having a slave connected to a master for a short time was not a problem, now this will cause writes to stop and connections pile up, so we need to be much more careful about how we manage our replication topology. A target of 99.999% uptime for a service now requires the same SLA on slaves being available and connected locally to acknowledge the commits.
On top of this running at “webscale” adds a layer of requirements of its own. Like the rest of our environment everything needs to be automated, anything that requires a human is not going to scale. So our automation needs to respond to any failure and heal the system without intervention in any circumstance. An edge case that has even a tiny chance of occurring on a given day needs to be handled automatically, to keep our SLA and to stop our engineers constantly having to fix things.
Tom: What are you looking forward to the most at this year’s conference (besides your own talk)?
Simon: I always enjoy the keynotes, they don’t all seem to be announced yet but it’s a great way to get a state of the community update. I’ll certainly stop by “Binlog Servers at Booking.com,” it sounds like they might be doing the same kind of things we are for Semi-Synchronous replication, so it’ll be great to compare ideas. I’ll also be looking at the talks on MySQL 5.7 to get the scoop on what cool new stuff is coming down the pipeline!
The post Facebook’s Simon Martin on semi-synchronous replication appeared first on MySQL Performance Blog.
After some recent work with systemd I’ve realized it’s power and I can come clean that I am a fan. I realize that there are multitudes of posts out there with arguments both for and against systemd but let’s look at some nice ways to have systemd provide us with (but not limited to) pt-kill-as-a-service.
This brief post introduces you to a systemd unit file and how we can leverage them to enable pt-kill at startup, have it start after mysqld and ensure that MySQL is running by using the mysql service as a dependency of pt-kill. By using systemd to handle this we don’t have to complicate matters by ‘monitoring the monitor’ using hacky shell scripts, cron or utilities like monit.
So then, a quick primer on systemd, because lets face it, we’ve all been avoiding it. Systemd is not new but it made recent headlines in the Linux world due to some of the major distros announcing their intentions to migrate upcoming releases to systemd.
What is it? Well due to it’s depth it is best described as a suite of management daemons, libraries and tools that will replace the traditional init scripts. So essentially remember how you start a service, mount a volume or read the system logs…well start forgetting all of that because systemd is disrupting this space. With systemd comes some really neat tricks for administering your machines and I’m really only beginning to see the tip of this iceberg. There is admittedly a lot to learn with systemd but this should serve as pragmatic entrée.Systemd what? When did this happen?Linux distributionDate released as defaultArch Linux000000002012-10-01-0000October 2012CoreOS000000002013-10-01-0000October 2013 (v94.0.0)Debian000000002015-04-01-0000April 2015 (v8 aka jessie)Fedora000000002011-05-01-0000May 2011 (v15)Gentoo LinuxN/AMageia000000002012-05-01-0000May 2012 (v2.0)openSUSE000000002012-09-01-0000September 2012 (v12.2)Red Hat Enterprise Linux000000002014-06-01-0000June 2014 (v7.0)SlackwareN/ASUSE Linux Enterprise Server000000002014-10-01-0000October 2014 (v12)Ubuntu000000002015-04-01-0000April 2015 (v15.04)
Lennart Poettering, the name frequently attached with systemd is seeking to modernize the most fundamental process(es) of the Linux startup system, bringing the paradigms of modern computing; concurrency, parallelism and efficiency. The dependency tree of processes and services is more intuitive and the structure of the underlying startup scripts are unified. I feel that the direction proposed by systemd is an evolutionary one which promotes consistency within the startup scripts enabling conventions that can be easier understood by a broader audience.Systemd and Percona Toolkit
This post aims to show that we can rely on systemd to handle processes such as pt-kill, pt-stalk, and other daemonized scripts that we like to have running perpetually, are fired at startup and can be reinstated after failure.
The scenario is this; I want pt-kill to drop all sleeping connections from a certain application user, lets call them, ‘caffeinebob’, because they never close connections. Due to various reasons we can’t make changes in the application so we’re employing Percona Toolkit favourite, pt-kill, to do this for us. For convenience we want this result to persist across server restarts. In the olden days we might have some cron job that fires a shell script in combination with a sentinal file to ensure it’s running. I’m pretty sure that this kitty could be skinned many ways.The systemd Unit File
After some research and testing, the below unit file will play nicely on a Centos 7 node with systemd at it’s core. In this example I am running Percona Server 5.6 installed using Percona’s yum repo with the mysql.service unit file generated at installation. I suspect that there could be some systemd deviation with other MySQL variants however, this configuration is working for me.[Unit] Description = pt-kill caffeinebob After=syslog.target mysql.service Requires=mysql.service [Service] Type = simple PIDFile = /var/run/ptkill.pid ExecStart = /usr/bin/pt-kill --daemonize --pid=/var/run/ptkill.pid --interval=5 --defaults-file=/root/.my.cnf --log=/var/log/ptkill.log --match-user caffeinebob --busy-time 10 --kill --print Restart=on-abort [Install] WantedBy=multi-user.target
Let’s examine the above and see what we’re working with. Systemd unit files have various biologies. The example above is a simple Service unit file. This means we are enacting a process controlled and supervised by systemd. The significance of the After directive is that this service will not attempt startup until after syslog.target and mysql.service have been called. The Required directive is makes ptkill.service dependant on the mysql.service startup being successful.
The next part, the [Service] grouping, details the actions to be taken by the service. The Type can be one of many but as it’s a simple call to a script I’ve used the simple type. We are describing the command and the handling of it. The ExecStart is evidently the pt-kill command that we would usually run from the shell prompt or from within a shell script. This is a very corse example because we can opt to parameterize the command with the assistance of an Environment file. Note the use of the Restart directive, used so that systemd can handle a reaction should a failure occur that interrupts the process.
Finally under the [Install] grouping we’re telling systemd that this service should startup on a multi user system, and could be thought of as runlevel 2 or 3 (Multiuser mode).
So providing that we’ve got all the relevant paths, users and dependencies in place, once you reboot your host, mysql.service should in order, initiate mysqld and when that dependency is met, systemd will initiate pt-kill with our desired parameters to cull connections that meet the criteria stipulated in our configuration. This means you rely on systemd to manage pt-kill for you and you don’t necessarily need to remember to start this or similar processes when you restart you node.Start up & enable
No feedback but no errors so we can check the status of the service[moore@localhost ~]$ sudo systemctl status ptkill -l ptkill.service - keep pt-kill persistent across restarts Loaded: loaded (/etc/systemd/system/ptkill.service; enabled) Active: active (running) since Wed 2015-08-12 02:39:13 BST; 1h 19min ago Main PID: 2628 (perl) CGroup: /system.slice/ptkill.service └─2628 perl /usr/bin/pt-kill --daemonize --pid=/var/run/ptkill.pid --interval=5 --defaults-file=/root/.my.cnf --log=/var/log/ptkill.log --match-user caffeinebob --busy-time 10 --kill --print
Perfect we can also instruct systemd to disable this and|or stop our service when the application is changed and caffeinebob close() all those open connections.[moore@localhost ~]$ sudo systemctl stop ptkill.service [moore@localhost ~]$ sudo systemctl disable ptkill.service
Now after successful implementation we see that our process is running delightfully;[moore@localhost ~]$ ps -ef | grep pt-kill root 2547 1 0 02:37 ? 00:00:00 perl /usr/bin/pt-kill --daemonize --pid=/var/run/ptkill.pid --interval=5 --defaults-file=/root/.my.cnf --log=/var/log/ptkill.log --match-user caffeinebob --busy-time 10 --kill --printCatch me if I fall
Lets issue a kill signal to the process and observe it’s behaviour using journalctl[moore@localhost ~]$ sudo kill -SEGV 2547
This will write similar entries into the system log;[moore@localhost ~]$ sudo journalctl -xn -f Aug 12 02:39:13 localhost.localdomain sudo: moore : TTY=pts/1 ; PWD=/home/moore ; USER=root ; COMMAND=/bin/kill -SEGV 2547 Aug 12 02:39:13 localhost.localdomain systemd: ptkill.service: main process exited, code=killed, status=11/SEGV Aug 12 02:39:13 localhost.localdomain systemd: Unit ptkill.service entered failed state. Aug 12 02:39:13 localhost.localdomain systemd: ptkill.service holdoff time over, scheduling restart. Aug 12 02:39:13 localhost.localdomain systemd: Stopping keep pt-kill persistent across restarts... -- Subject: Unit ptkill.service has begun shutting down -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit ptkill.service has begun shutting down. Aug 12 02:39:13 localhost.localdomain systemd: Starting keep pt-kill persistent across restarts... -- Subject: Unit ptkill.service has begun with start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit ptkill.service has begun starting up. Aug 12 02:39:13 localhost.localdomain systemd: Started keep pt-kill persistent across restarts. -- Subject: Unit ptkill.service has finished start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit ptkill.service has finished starting up. -- -- The start-up result is done.
Pt-kill flaps after the kill signal but systemd has been instructed to restart on failure so we don’t see caffeinebob saturate our processlist with sleeping connections.
Another bonus with this workflow is use within orchestration. Any standardized unit files can be propagated to your fleet of hosts with tools such as Ansible, Chef, Puppet or Saltstack.Closing note
I’d love to hear from the pragmatists from the systemd world to understand if this approach can be improved or whether there are any flaws in this example unit file that would require addressing. This is very much a new-school of thought for me and feedback is both welcome and encouraged.
Thank you for your time, happy systemd-ing.
Percona is glad to announce the release of Percona Server 5.5.45-37.4 on September 2, 2015. Based on MySQL 5.5.45, including all the bug fixes in it, Percona Server 5.5.45-37.4 is now the current stable release in the 5.5 series.
Booking.com, one of the world’s leading e-commerce companies, helps travels book nearly 1 million rooms per night. Established in 1996, Booking.com B.V. guarantees the best prices for any type of property, from small, family-run bed and breakfasts to executive apartments and five-star luxury suites.
The travel website is also a dedicated contributor to the MySQL and Perl community. Other open source technologies include CentOS Linux, Nginx, python, puppet, Git and more.
A Diamond sponsor of Percona Live Amsterdam Sept. 21-23, you can meet the people who power Booking.com at booth 205. Enter promo code “BlogInterview” at registration to save €20!
In the meantime, meet Jean-François Gagné, a system engineer at Booking.com. He’ll be presenting a couple of talks: “Riding the Binlog: an in Deep Dissection of the Replication Stream” and “Binlog Servers at Booking.com.”
Tom: Hi Jean-François, in your session, “Riding the Binlog: an in Deep Dissection of the Replication Stream“, you talk about how we can think of the binary logs as a transport for a “Stream of Transactions”. What will be the top 3 things attendees will come away with following this 50-minute talk?
Jean-François: Hi Tom, thanks for this opportunity to give a sneak peak of my talk. The most important subject that will be discussed is that the binary logs evolves: by the usage of “log-slave-updates”, the stream can grow, shrink or morph. Said in another way: the binary logs of a slave can be very different from the binary logs of the master, and this should be taken into account when relying on those (including when replicating using intermediate master and when promoting a slave as a new master using GTIDs). We will also explore how the binary logs can be decomposed in sub-streams, or viewed as the multiplexing of many streams. We will also look for de-multiplexing functions and the new possibilities that are opened with that.
Tom: Percona Live, starting with this conference, has a new venue and a broader theme – now encompassing, in addition to MySQL, MongoDB, NoSQL and data in the cloud. Your thoughts? And what do think is missing – what would you change (if anything)?
Jean-François: I think you forget the best of all changes: going from a 2 day conference last year in London to a 3 day conference this year. This will allow better knowledge exchange and I am very happy about that. I think this event will be a success with a good balance of sessions focus on technologies and presentation about a specific use-case of those technologies. If I had one wish: I would like to see more sessions about specific use-cases of NoSQL technologies with and in deep discussion about why they are a better choice than more traditional solutions: maybe more of those sessions will be submitted next year.
Tom: Which other session(s) are you most looking forward to besides your own?
Jean-François: I will definitely attend the Facebook session about Semi-Synchronous Replication: it is very close to my interest, especially as Booking.com is thinking about using loss-less semi-sync replication in the future, and I look forward to hear war stories about this feature. All sessions dissecting internals of a technology (InnoDB, TokuDB, RocksDB, …) will also have my attention. Finally, it is always interesting to hear about how large companies are using databases, so I plan to attend the MySQL@Wikimedia session.
Tom: As a resident of Amsterdam, what are some of the must-do activities/sightseeing for those visiting for Percona Live from out of town?
Jean-François: Seeing the city from a high point is impressive, and you will have the opportunity of enjoying that view from the Booking.com office at the Community Dinner. Also, I recommend finding a bike and discover the city pedaling (there are many renting shops, just ask Google). From the conference venue, you can do a 70 minutes ride crossing three nice parks: the Westerpark, the Rembrandtpark and the Vondelpark – https://goo.gl/P13Mc7 – and you can discover the first of third park in a shorter ride (45 minutes). If you feel a little more adventurous, I recommend a 90 minute ride South following the Amstel: once out of Amsterdam, you will have the water on one side at the level of the road, and the fields (Polder) 3 meters below on the other side (https://goo.gl/OPDv5z). This will allow you to see for yourself why this place is called the “Low Countries”.
The post Booking dot yeah! Booking.com’s Jean-François Gagné on Percona Live Amsterdam appeared first on MySQL Performance Blog.
There can be a lot of confusion and lack of planning in Percona XtraDB Clusters in regards to nodes becoming desynchronized for various reasons. This can happen a few ways:
When I say “desynchronized” I mean a node that is permitted to build up a potentially large wsrep_local_recv_queue while some operation is happening. For example a node taking a backup would set wsrep_desync=ON during the backup and potentially fall behind replication some amount.
Some of these operations may completely block Galera from applying transactions, while others may simply increase load on the server enough that it falls behind and applies at a reduced rate.
In all the cases above, flow control is NOT used while the node cannot apply transactions, but it MAY be used while the node is recovering from the operation. For an example of this, see my last blog about IST.
If a cluster is fairly busy, then the flow control that CAN happen when the above operations catch up MAY be detrimental to performance.Example setup
Let us take my typical 3 node cluster with workload on node1. We are taking a blocking backup of some kind on node3 so we are executing the following steps:
This includes up through step 3 above. My node1 is unaffected by the backup on node3, I can see it averaging 5-6k writesets(transactions) per second which it did before we began:
node2 is also unaffected:
but node3 is not applying and its queue is building up:
Let’s examine briefly what happens when node3 is permitted to start applying, but wsrep_desync stays enabled:
node1’s performance is pretty much the same, node3 is not using flow control yet. However, there is a problem:
It’s hard to notice, but node3 is NOT catching up, instead it is falling further behind! We have potentially created a situation where node3 may never catch up.
The PXC nodes were close enough to the red-line of performance that node3 can only apply just about as fast (and somewhat slower until it heats up a bit) as new transactions are coming into node1.
This represents a serious concern in PXC capacity planning:
Nodes do not only need to be fast enough to handle normal workload, but also to catch up after maintenance operations or failures cause them to fall behind.
Experienced MySQL DBA’s will realize this isn’t all that different than Master/Slave replication.Flow Control as a way to recovery
So here’s the trick: if we turn off wsrep_desync on node3 now, node3 will use flow control if and only if the incoming replication exceeds node3’s apply rate. This gives node3 a good chance of catching up, but the tradeoff is reducing write throughput of the cluster. Let’s see what this looks like in context with all of our steps. wsrep_desync is turned off at the peak of the replication queue size on node3, around 12:20PM:
So at the moment node3 starts utilizing flow control to prevent falling further behind, our write throughput (in this specific environment and workload) is reduced by approximately 1/3rd (YMMV). The cluster will remain in this state until node3 catches up and returns to the ‘Synced’ state. This catchup is still happening as I write this post, almost 4 hours after it started and will likely take another hour or two to complete.
I can see a more realtime representation of this by using myq_status on node1, summarizing every minute:[root@node1 ~]# myq_status -i 1m wsrep mycluster / node1 (idx: 1) / Galera 3.11(ra0189ab) Cluster Node Outbound Inbound FlowC Conflct Gcache Appl time P cnf # stat laten msgs data que msgs data que pause snt lcf bfa ist idx %ef 19:58:47 P 5 3 Sync 0.9ms 3128 2.0M 0 27 213b 0 25.4s 0 0 0 3003k 16k 62% 19:59:47 P 5 3 Sync 1.1ms 3200 2.1M 0 31 248b 0 18.8s 0 0 0 3003k 16k 62% 20:00:47 P 5 3 Sync 0.9ms 3378 2.2M 32 27 217b 0 26.0s 0 0 0 3003k 16k 62% 20:01:47 P 5 3 Sync 0.9ms 3662 2.4M 32 33 266b 0 18.9s 0 0 0 3003k 16k 62% 20:02:47 P 5 3 Sync 0.9ms 3340 2.2M 32 27 215b 0 27.2s 0 0 0 3003k 16k 62% 20:03:47 P 5 3 Sync 0.9ms 3193 2.1M 0 27 215b 0 25.6s 0 0 0 3003k 16k 62% 20:04:47 P 5 3 Sync 0.9ms 3009 1.9M 12 28 224b 0 22.8s 0 0 0 3003k 16k 62% 20:05:47 P 5 3 Sync 0.9ms 3437 2.2M 0 27 218b 0 23.9s 0 0 0 3003k 16k 62% 20:06:47 P 5 3 Sync 0.9ms 3319 2.1M 7 28 220b 0 24.2s 0 0 0 3003k 16k 62% 20:07:47 P 5 3 Sync 1.0ms 3388 2.2M 16 31 251b 0 22.6s 0 0 0 3003k 16k 62% 20:08:47 P 5 3 Sync 1.1ms 3695 2.4M 19 39 312b 0 13.9s 0 0 0 3003k 16k 62% 20:09:47 P 5 3 Sync 0.9ms 3293 2.1M 0 26 211b 0 26.2s 0 0 0 3003k 16k 62%
This reports around 20-25 seconds of flow control every minute, which is consistent with that ~1/3rd of performance reduction we see in the graphs above.
Watching node3 the same way proves it is sending the flow control (FlowC snt):mycluster / node3 (idx: 2) / Galera 3.11(ra0189ab) Cluster Node Outbound Inbound FlowC Conflct Gcache Appl time P cnf # stat laten msgs data que msgs data que pause snt lcf bfa ist idx %ef 17:38:09 P 5 3 Dono 0.8ms 0 0b 0 4434 2.8M 16m 25.2s 31 0 0 18634 16k 80% 17:39:09 P 5 3 Dono 1.3ms 0 0b 1 5040 3.2M 16m 22.1s 29 0 0 37497 16k 80% 17:40:09 P 5 3 Dono 1.4ms 0 0b 0 4506 2.9M 16m 21.0s 31 0 0 16674 16k 80% 17:41:09 P 5 3 Dono 0.9ms 0 0b 0 5274 3.4M 16m 16.4s 27 0 0 22134 16k 80% 17:42:09 P 5 3 Dono 0.9ms 0 0b 0 4826 3.1M 16m 19.8s 26 0 0 16386 16k 80% 17:43:09 P 5 3 Jned 0.9ms 0 0b 0 4957 3.2M 16m 18.7s 28 0 0 83677 16k 80% 17:44:09 P 5 3 Jned 0.9ms 0 0b 0 3693 2.4M 16m 27.2s 30 0 0 131k 16k 80% 17:45:09 P 5 3 Jned 0.9ms 0 0b 0 4151 2.7M 16m 26.3s 34 0 0 185k 16k 80% 17:46:09 P 5 3 Jned 1.5ms 0 0b 0 4420 2.8M 16m 25.0s 30 0 0 245k 16k 80% 17:47:09 P 5 3 Jned 1.3ms 0 0b 1 4806 3.1M 16m 21.0s 27 0 0 310k 16k 80%
There are a lot of flow control messages (around 30) per minute. This is a lot of ON/OFF toggles of flow control where writes are briefly delayed rather than a steady “you can’t write” for 20 seconds straight.
It also interestingly spends a long time in the Donor/Desynced state (even though wsrep_desync was turned OFF hours before) and then moves to the Joined state (this has the same meaning as during an IST).Does it matter?
As always, it depends.
If these are web requests and suddenly the database can only handle ~66% of the traffic, that’s likely a problem, but maybe it just slows down the website somewhat. I want to emphasize that WRITES are what is affected here. Reads on any and all nodes should be normal (though you probably don’t want to read from node3 since it is so far behind).
If this were some queue processing that had reduced throughput, I’d expect it to possibly catch up later
This can only be answered for your application, but the takeaways for me are:
Graphs in this post courtesy of VividCortex.
The post High-load clusters and desynchronized nodes on Percona XtraDB Cluster appeared first on MySQL Performance Blog.
Percona is pleased to announce the availability of Percona Toolkit 2.2.15. Released August 28, 2015. Percona Toolkit is a collection of advanced command-line tools to perform a variety of MySQL server and system tasks that are too difficult or complex for DBAs to perform manually. Percona Toolkit, like all Percona software, is free and open source.
This release is the current GA (Generally Available) stable release in the 2.2 series. It includes multiple bug fixes as well as continued preparation for MySQL 5.7 compatibility. Full details are below. Downloads are available here and from the Percona Software Repositories.