Buy Percona SupportBuy Now
Subscribe to Latest MySQL Performance Blog posts feed
Updated: 2 hours 6 min ago

Docker Images for MySQL Group Replication 5.7.14

August 8, 2016 - 9:06am

In this post, I will point you to Docker images for MySQL Group Replication testing.

There is a new release of MySQL Group Replication plugin for MySQL 5.7.14. It’s a “beta” plugin and it is probably the last (or at lease one of the final pre-release packages) before Group Replication goes GA (during Oracle OpenWorld 2016, in our best guess).

Since it is close to GA, it would be great to get a better understanding of this new technology. Unfortunately, MySQL Group Replication installation process isn’t very user-friendly.

Or, to put it another way, totally un-user-friendly! It consists of a mere “50 easy steps” – by which I think they mean “easy” to mess up.

Matt Lord, in his post http://mysqlhighavailability.com/mysql-group-replication-a-quick-start-guide/, acknowledges: “getting a working MySQL service consisting of 3 Group Replication members is not an easy “point and click” or automated single command style operation.”

I’m not providing a review of MySQL Group Replication 5.7.14 yet – I need to play around with it a lot more. To make this process easier for myself, and hopefully more helpful to you, I’ve prepared Docker images for the testing of MySQL Group Replication.

Docker Images

To start the first node, run:

docker run -d --net=cluster1 --name=node1 perconalab/mysql-group-replication --group_replication_bootstrap_group=ON

To join all following nodes:

docker run -d --net=cluster1 --name=node2 perconalab/mysql-group-replication --group_replication_group_seeds='node1:6606'

Of course, you need to have Docker Network running:

docker network create cluster1

I hope this will make the testing process easier!

Percona XtraDB Cluster on Ceph

August 4, 2016 - 3:31pm

This post discusses how XtraDB Cluster and Ceph are a good match, and how their combination allows for faster SST and a smaller disk footprint.

My last post was an introduction to Red Hat’s Ceph. As interesting and useful as it was, it wasn’t a practical example. Like most of the readers, I learn about and see the possibilities of technologies by burning my fingers on them. This post dives into a real and novel Ceph use case: handling of the Percona XtraDB Cluster SST operation using Ceph snapshots.

If you are familiar with Percona XtraDB Cluster, you know that a full state snapshot transfer (SST) is required to provision a new cluster node. Similarly, SST can also be triggered when a cluster node happens to have a corrupted dataset. Those SST operations consist essentially of a full copy of the dataset sent over the network. The most common SST methods are Xtrabackup and rsync. Both of these methods imply a significant impact and load on the donor while the SST operation is in progress.

For example, the whole dataset will need to be read from the storage and sent over the network, an operation that requires a lot of IO operations and CPU time. Furthermore, with the rsync SST method, the donor is under a read lock for the whole duration of the SST. Consequently, it can take no write operations. Such constraints on SST operations are often the main motivations beyond the reluctance of using Percona XtraDB cluster with large datasets.

So, what could we do to speed up SST? In this post, I will describe a method of performing SST operations when the data is not local to the nodes. You could easily modify the solution I am proposing for any non-local data source technology that supports snapshots/clones, and has an accessible management API. Off the top of my head (other than Ceph) I see AWS EBS and many SAN-based storage solutions as good fits.

The challenges of clone-based SST

If we could use snapshots and clones, what would be the logical steps for an SST? Let’s have a look at the following list:

  1. New node starts (joiner) and unmounts its current MySQL datadir
  2. The joiner and asks for an SST
  3. The donor creates a consistent snapshot of its MySQL datadir with the Galera position
  4. The donor sends to the joiner the name of the snapshot to use
  5. The joiner creates a clone of the snapshot name provided by the donor
  6. The joiner mounts the snapshot clone as the MySQL datadir and adjusts ownership
  7. The joiner initializes MySQL on the mounted clone

As we can see, all these steps are fairly simple, but hide some challenges for an SST method base on cloning. The first challenge is the need to mount the snapshot clone. Mounting a block device requires root privileges – and SST scripts normally run under the MySQL user. The second challenge I encountered wasn’t expected. MySQL opens the datadir and some files in it before the SST happens. Consequently, those files are then kept opened in the underlying mount point, a situation that is far from ideal. Fortunately, there are solutions to both of these challenges as we will see below.

SST script

So, let’s start with the SST script. The script is available in my Github at:

https://github.com/y-trudeau/ceph-related-tools/raw/master/wsrep-sst/wsrep_sst_ceph

You should install the script in the /usr/bin directory, along with the other user scripts. Once installed, I recommend:

chown root.root /usr/bin/wsrep_sst_ceph chmod 755 /usr/bin/wsrep_sst_ceph

The script has a few parameters that can be defined in the [sst] section of the my.cnf file.

cephlocalpool
The Ceph pool where this node should create the clone. It can be a different pool from the one of the original dataset. For example, it could have a replication factor of 1 (no replication) for a read scaling node. The default value is: mysqlpool
cephmountpoint
What mount point to use. It defaults to the MySQL datadir as provided to the SST script.
cephmountoptions
The options used to mount the filesystem. The default value is: rw,noatime
cephkeyring
The Ceph keyring file to authenticate against the Ceph cluster with cephx. The user under which MySQL is running must be able to read the file. The default value is: /etc/ceph/ceph.client.admin.keyring
cephcleanup
Whether or not the script should cleanup the snapshots and clones that are no longer is used. Enable = 1, Disable = 0. The default value is: 0
Root privileges

In order to allow the SST script to perform privileged operations, I added an extra SST role: “mount”. The SST script on the joiner will call itself back with sudo and will pass “mount” for the role parameter. To allow the elevation of privileges, the follow line must be added to the /etc/sudoers file:

mysql ALL=NOPASSWD: /usr/bin/wsrep_sst_ceph

Files opened by MySQL before the SST

Upon startup, MySQL opens files at two places in the code before the SST completes. The first one is in the function mysqld_main , which sets the current working directory to the datadir (an empty directory at that point).  After the SST, a block device is mounted on the datadir. The issue is that MySQL tries to find the files in the empty mount point directory. I wrote a simple patch, presented below, and issued a pull request:

diff --git a/sql/mysqld.cc b/sql/mysqld.cc index 90760ba..bd9fa38 100644 --- a/sql/mysqld.cc +++ b/sql/mysqld.cc @@ -5362,6 +5362,13 @@ a file name for --log-bin-index option", opt_binlog_index_name); } } } + + /* + * Forcing a new setwd in case the SST mounted the datadir + */ + if (my_setwd(mysql_real_data_home,MYF(MY_WME)) && !opt_help) + unireg_abort(1); /* purecov: inspected */ + if (opt_bin_log) { /*

With this patch, I added a new my_setwd call right after the SST completed. The Percona engineering team approved the patch, and it should be added to the upcoming release of Percona XtraDB Cluster.

The Galera library is the other source of opened files before the SST. Here, the fix is just in the configuration. You must define the base_dir Galera provider option outside of the datadir. For example, if you use /var/lib/mysql as datadir and cephmountpoint, then you should use:

wsrep_provider_options="base_dir=/var/lib/galera"

Of course, if you have other provider options, don’t forget to add them there.

Walkthrough

So, what are the steps required to use Ceph with Percona XtraDB Cluster? (I assume that you have a working Ceph cluster.)

1. Join the Ceph cluster

The first thing you need is a working Ceph cluster with the needed CephX credentials. While the setup of a Ceph cluster is beyond the scope of this post, we will address it in a subsequent post. For now, we’ll focus on the client side.

You need to install the Ceph client packages on each node. On my test servers using Ubuntu 14.04, I did:

wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add - sudo apt-add-repository 'deb http://download.ceph.com/debian-infernalis/ trusty main' apt-get update apt-get install ceph

These commands also installed all the dependencies. Next, I copied the Ceph cluster configuration file /etc/ceph/ceph.conf:

[global] fsid = 87671417-61e4-442b-8511-12659278700f mon_initial_members = odroid1, odroid2 mon_host = 10.2.2.100, 10.2.2.20, 10.2.2.21 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true osd_journal = /var/lib/ceph/osd/journal osd_journal_size = 128 osd_pool_default_size = 2

and the authentication file /etc/ceph/ceph.client.admin.keyring from another node. I made sure these files were readable by all. You can define more refined privileges for a production system with CephX, the security layer of Ceph.

Once everything is in place, you can test if it is working with this command:

root@PXC3:~# ceph -s cluster 87671417-61e4-442b-8511-12659278700f health HEALTH_OK monmap e2: 3 mons at {odroid1=10.2.2.20:6789/0,odroid2=10.2.2.21:6789/0,serveur-famille=10.2.2.100:6789/0} election epoch 474, quorum 0,1,2 odroid1,odroid2,serveur-famille mdsmap e204: 1/1/1 up {0=odroid3=up:active} osdmap e995: 4 osds: 4 up, 4 in pgmap v275501: 1352 pgs, 5 pools, 321 GB data, 165 kobjects 643 GB used, 6318 GB / 7334 GB avail 1352 active+clean client io 16491 B/s rd, 2425 B/s wr, 1 op/s

Which gives the current state of the Ceph cluster.

2. Create the Ceph pool

Before we can use Ceph, we need to create a first RBD image, put a filesystem on it and mount it for MySQL on the bootstrap node. We need at least one Ceph pool since the RBD images are stored in a Ceph pool.  We create a Ceph pool with the command:

ceph osd pool create mysqlpool 512 512 replicated

Here, we have defined the pool mysqlpool with 512 placement groups. On a larger Ceph cluster, you might need to use more placement groups (again, a topic beyond the scope of this post). The pool we just created is replicated. Each object in the pool will have two copies as defined by the osd_pool_default_size parameter in the ceph.conf file. If needed, you can modify the size of a pool and its replication factor at any moment after the pool is created.

3. Create the first RBD image

Now that we have a pool, we can create a first RBD image:

root@PXC1:~# rbd -p mysqlpool create PXC --size 10240 --image-format 2

and “map” the RBD image to a host block device:

root@PXC1:~# rbd -p mysqlpool map PXC /dev/rbd1

The commands return the local RBD block device that corresponds to the RBD image. The other steps are not specific to RBD images, we need to create a filesystem and prepare the mount points.

The rest of the steps are not specific to RBD images. We need to create a filesystem and prepare the mount points:

mkfs.xfs /dev/rbd1 mount /dev/rbd1 /var/lib/mysql -o rw,noatime,nouuid chown mysql.mysql /var/lib/mysql mysql_install_db --datadir=/var/lib/mysql --user=mysql mkdir /var/lib/galera chown mysql.mysql /var/lib/galera

You need to mount the RBD device and run the mysql_install_db tool only on the bootstrap node. You need to create the directories /var/lib/mysql and /var/lib/galera on the other nodes and adjust the permissions similarly.

4. Modify the my.cnf files

You will need to set or adjust the specific wsrep_sst_ceph settings in the my.cnf file of all the servers. Here are the relevant lines from the my.cnf file of one of my cluster node:

[mysqld] wsrep_provider=/usr/lib/libgalera_smm.so wsrep_provider_options="base_dir=/var/lib/galera" wsrep_cluster_address=gcomm://10.0.5.120,10.0.5.47,10.0.5.48 wsrep_node_address=10.0.5.48 wsrep_sst_method=ceph wsrep_cluster_name=ceph_cluster [sst] cephlocalpool=mysqlpool cephmountoptions=rw,noatime,nodiratime,nouuid cephkeyring=/etc/ceph/ceph.client.admin.keyring cephcleanup=1

At this point, we can bootstrap the cluster on the node where we mounted the initial RBD image:

/etc/init.d/mysql bootstrap-pxc

5. Start the other XtraDB Cluster nodes

The first node does not perform an SST, so nothing exciting so far. With the patched version of MySQL (the above patch), starting MySQL on a second node triggers a Ceph SST operation. In my test environment, the SST take about five seconds to complete on low-powered VMs. Interestingly, the duration is not directly related to the dataset size. Because of this, a much larger dataset, on a quiet database, should take about the exact same time. A very busy database may need more time, since an SST requires a “flush tables with read lock” at some point.

So, after their respective Ceph SST, the other two nodes have:

root@PXC2:~# mount | grep mysql /dev/rbd1 on /var/lib/mysql type xfs (rw,noatime,nodiratime,nouuid) root@PXC2:~# rbd showmapped id pool image snap device 1 mysqlpool PXC2-1463776424 - /dev/rbd1 root@PXC3:~# mount | grep mysql /dev/rbd1 on /var/lib/mysql type xfs (rw,noatime,nodiratime,nouuid) root@PXC3:~# rbd showmapped id pool image snap device 1 mysqlpool PXC3-1464118729 - /dev/rbd1

The original RBD image now has two snapshots that are mapped to the clones mounted by other two nodes:

root@PXC3:~# rbd -p mysqlpool ls PXC PXC2-1463776424 PXC3-1464118729 root@PXC3:~# rbd -p mysqlpool info PXC2-1463776424 rbd image 'PXC2-1463776424': size 10240 MB in 2560 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.108b4246146651 format: 2 features: layering flags: parent: mysqlpool/PXC@1463776423 overlap: 10240 MB

Discussion

Apart from allowing faster SST, what other benefits do we get from using Ceph with Percona XtraDB Cluster?

The first benefit is the inherent data duplication over the network removes the need for local data replication. Thus, instead of using raid-10 or raid-5 with an array of disks, we could use a simple raid-0 stripe set if the data is already replicated to more than one server.

The second benefit is a bit less obvious: you don’t need as much storage. Why? A Ceph clone only stores the delta from its original snapshot. So, for large, read intensive datasets, the disk space savings can be very significant. Of course, over time, the clone will drift away from its parent snapshot and will use more and more space. When we determine that a Ceph clone uses too much disk space, we can simply refresh the clone by restarting MySQL and forcing a full SST. The SST script will automatically drop the old clone and snapshot when the cephcleanup option is set, and it will create a new fresh clone. You can easily evaluate how much space is consumed by the clone using the following commands:

root@PXC2:~# rbd -p mysqlpool du PXC2-1463776424 warning: fast-diff map is not enabled for PXC2-1463776424. operation may be slow. NAME PROVISIONED USED PXC2-1463776424 10240M 164M

Also, nothing prevents you using a different configuration of Ceph pools in the same XtraDB cluster. Therefore a Ceph clone can use a different pool than its parent snapshot. That’s the whole purpose of the cephlocalpool parameter. Strictly speaking, you only need one node to use a replicated pool, as the other nodes could run on clones that are stored data in a non-replicated pool (saving a lot of storage space). Furthermore, we can define the OSD affinity of the non-replicated pool in a way that it stores data on the host where it is used, reducing the cross node network latency.

Using Ceph for XtraDB Cluster SST operation demonstrates one of the array of possibilities offered to MySQL by Ceph. We continue to work with the Red Hat team and Red Hat Ceph Storage architects to find new and useful ways of addressing database issues in the Ceph environment. There are many more posts to come, so stay tuned!

DISCLAIMER: The wsrep_sst_ceph script isn’t officially supported by Percona.

Testing Docker multi-host network performance

August 3, 2016 - 12:26pm

In this post, I’ll review Docker multi-host network performance.

In a past post, I tested Docker network. The MySQL Server team provided their own results, which are in line with my observations.

For this set of tests, I wanted to focus more on Docker networking using multiple hosts. Mostly because when we set up a high availability (HA) environment (using Percona XtraDB Cluster, for example) the expectation is that instances are running on different hosts.

Another reason for this test is that Docker recently announced the 1.12 release, which supports Swarm Mode. Swarm Mode is quite interesting by itself — with this release, Docker targets going deeper on Orchestration deployments in order to compete with Kubernetes and Apache Mesos. I would say Swarm Mode is still rough around the edges (expected for a first release), but I am sure Docker will polish this feature in the next few releases. Swarm Mode also expects that you run services on different physical hosts, and services

Swarm Mode also expects that you run services on different physical hosts, and services are communicated over Docker network. I wanted to see how much of a performance hit we get when we run over Docker network on multiple hosts.

Network performance is especially important for clustering setups like Percona XtraDB Cluster and  MySQL Group Replication (which just put out another Lab release).

For my setup, I used two physical servers connected over a 10GB network. Both servers use 56 cores total of Intel CPUs.

Sysbench setup: data fits into memory, and I will only use primary key lookups. Testing over the network gives the worst case scenario for network round trips, but it also gives a good visibility on performance impacts.

The following are options for Docker network:

  • No Docker containers (marked as “direct” in the following results)
  • Docker container uses “host” network (marked as “host”)
  • Docker container uses “bridge” network, where service port exposed via port forwarding (marked as “bridge”)
  • Docker container uses “overlay” network, both client and server are started in containers connected via overlay network (marked as “overlay” in the results). For “overlay” network it is possible to use third-party plugins, with different implementation of the network, the most known are:

For multi-host networking setup, only “overlay” (and plugins implementations) are feasible. I used “direct”, “host” and “bridge” only for the reference and as a comparison to measure the overhead of overlay implementations.

The results I observed are:

Client Server Throughput, tps Ratio to “direct-direct” Direct Direct 282780 1.0 Direct Host 280622 0.99 Direct Bridge 250104 0.88 Bridge Bridge 235052 0.83 overlay overlay 120503 0.43 Calico overlay Calico overlay 246202 0.87 Weave overlay Weave overlay 11554 0.044

 

Observations
  • “Bridge” network added overhead, about 12%, which is in line with my previous benchmark. I wonder, however, if this is Docker overhead or just the Linux implementation of bridge networks. Docker should be using the setup that I described in Running Percona XtraDB Cluster nodes with Linux Network namespaces on the same host, and I suspect that the Linux network namespaces and bridges add overhead. I need to do more testing to verify.
  • Native “Overlay” Docker network struggled from performance problems. I observed issues with ksoftirq using 100% of one CPU core, and I see similar reports. It seems that network interruptions in Docker “overlay” are not distributed properly across multiple CPUs. This is not the case with the “direct” and “bridge” configuration. I believe this is a problem with the Docker “overlay” network (hopefully, it will eventually be fixed).
  • Weave network showed absolutely terrible results. I see a lot of CPU allocated to “weave” containers, so I think there are serious scalability issues in their implementation.
  • Calico plugin showed the best result for multi-host containers, even better than “bridge-bridge” network setup
Conclusion

If you need to use Docker “overlay” network — which is a requirement if you are looking to deploy a multi-host environment or use Docker Swarm mode — I recommend you consider using the Calico network plugin for Docker. Native Docker “overlay” network can be used for prototype or quick testing cases, but at this moment it shows performance problems on high-end hardware.

 

Take Percona’s One-Click Database Security Downtime Poll

August 2, 2016 - 3:12pm

Take Percona’s database security downtime poll.

As Peter Zaitsev mentioned recently in his blog post on database support, the data breach costs can hit both your business reputation and your bottom line. Costs vary depending on the company size and market, but recent studies estimate direct costs ranging in average from $1.6M to 7.01M. Everyone agrees leaving rising security risks and costs unchecked is a recipe for disaster.

Reducing security-based outages doesn’t have a simple answer, but can be a combination of internal and external monitoring, support contracts, enhanced security systems, and a better understanding of security configuration settings.

Please take a few seconds and answer the following poll. It will help the community get an idea of how security breaches can impact their critical database environments.

If you’ve faced  specific issues, feel free to comment below. We’ll post a follow-up blog with the results!

Note: There is a poll embedded within this post, please visit the site to participate in this post's poll.

You can see the results of our last blog poll on high availability here.

High Availability Poll Results

August 2, 2016 - 3:11pm

This blog reports the results of Percona’s high availability poll.

High availability (HA) is always a hot topic. The reality is that if your data is not available, your customers cannot do business with you. In fact, estimates show the average cost of downtime is about $5K per minute. With an average outage taking 40 minutes to correct, you could be looking at a potential cost of $200K if your MySQL instance goes down. Whether your database is on premise, or in public or private clouds, it is critical that your database deployment does not have a potentially devastating single point of failure.

The results from Percona’s high availability poll responses are in:

Note: There is a poll embedded within this post, please visit the site to participate in this post's poll.

With over 700 participants, MySQL replication was the clear frontrunner when it comes to high availability solutions.

Percona has HA solutions available, come find out more at our website.

If you’re using other solutions or have specific issues, feel free to comment below.

Check out the latest Percona one-click poll on database security here.

Introduction into storage engine troubleshooting: Q & A

August 1, 2016 - 3:43pm

In this blog, I will provide answers to the Q & A for the “Introduction into storage engine troubleshooting” webinar.

First, I want to thank everybody for attending the July 14 webinar. The recording and slides for the webinar are available here. Below is the list of your questions that I wasn’t able to answer during the webinar, with responses:

Q: At which isolation level do pt-online-schema-change and pt-archive  copy data from a table?

A: Both tools do not change the server’s default transaction isolation level. Use either REPEATABLE READ or set it in my.cnf.

Q: Can I create an index to optimize a query which has group by A and order by B, both from different tables and A column is from the first table in the two table join?

A: Do you mean a query like SELECT ... FROM a, b GROUP BY a.A ORDER BY b.B ? Yes, this is possible:

mysql> explain select A, B, count(*) from a join b on(a.A=b.id) WHERE b.B < 4 GROUP BY a.A, b.B ORDER BY b.B ASC; +----+-------------+-------+-------+---------------+------+---------+-----------+------+-----------------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+------+---------+-----------+------+-----------------------------------------------------------+ | 1 | SIMPLE | b | range | PRIMARY,B | B | 5 | NULL | 15 | Using where; Using index; Using temporary; Using filesort | | 1 | SIMPLE | a | ref | A | A | 5 | test.b.id | 1 | Using index | +----+-------------+-------+-------+---------------+------+---------+-----------+------+-----------------------------------------------------------+ 2 rows in set (0.00 sec)

Q: Where can I find recommendations on what kind of engine to use for different application types or use cases?

A: Storage engines are always being actively developed, therefore I suggest that you don’t search for generic recommendations. These can be outdated just a few weeks after they are written. Study engines instead. For example, just a few years ago MyISAM was the only engine (among those officially supported) that could work with FULLTEXT indexes and SPATIAL columns. Now InnoDB supports both: FULLTEXT indexes since version 5.6 and GIS features in 5.7. Today I can recommend InnoDB as a general-purpose engine for all installations, and TokuDB for write-heavy workloads when you cannot use high-speed disks.

Alternative storage engines can help to realize specific business needs. For example, CONNECT brings data to your server from many sources, SphinxSE talks to the Sphinx daemon, etc.

Other alternative storage engines increase the speed of certain workloads. Memory, for example, can be a good fit for temporary tables.

Q: Can you please explain how we find the full text of the query when we query the view ‘statements_with_full_table_Scans’?

A: Do you mean view in sys schema? Sys schema views take information from summary_* and digests it in Performance Schema, therefore it does not contain full queries (only digests). Full text of the query can be found in the events_statements_*  tables in the Performance Schema. Note that even the events_statements_history_long  table can be rewritten very quickly, and you may want to save data from it periodically.

Q: Hi is TokuDB for the new document protocol?

A: As Alex Rubin showed in his detailed blog post, the new document protocol just converts NoSQL queries into SQL, and is thus not limited to any storage engine. To use documents and collections, a storage engine must support generated columns (which TokuDB currently does not). So support of X Protocol for TokuDB is limited to relational tables access.

Q: Please comment on “read committed” versus “repeatable read.”
Q: Repeatable read holds the cursor on the result set for the client versus read committed where the cursor is updated after a transaction.

A: READ COMMITTED and REPEATABLE READ are transaction isolation levels, whose details are explained here.
I would not correlate locks set on table rows in different transaction isolation modes with the result set. A transaction with isolation level REPEATABLE READ  instead creates a snapshot of rows that are accessed by the transaction. Let’s consider a table:

mysql> create table ti(id int not null primary key, f1 int) engine=innodb; Query OK, 0 rows affected (0.56 sec) mysql> insert into ti values(1,1), (2,2), (3,3), (4,4), (5,5), (6,6), (7,7), (8,8), (9,9); Query OK, 9 rows affected (0.03 sec) Records: 9 Duplicates: 0 Warnings: 0

Then start the transaction and select a few rows from this table:

mysql1> begin; Query OK, 0 rows affected (0.00 sec) mysql1> select * from ti where id < 5; +----+------+ | id | f1 | +----+------+ | 1 | 1 | | 2 | 2 | | 3 | 3 | | 4 | 4 | +----+------+ 4 rows in set (0.04 sec)

Now let’s update another set of rows in another transaction:

mysql2> update ti set f1 = id*2 where id > 5; Query OK, 4 rows affected (0.06 sec) Rows matched: 4 Changed: 4 Warnings: 0 mysql2> select * from ti; +----+------+ | id | f1 | +----+------+ | 1 | 1 | | 2 | 2 | | 3 | 3 | | 4 | 4 | | 5 | 5 | | 6 | 12 | | 7 | 14 | | 8 | 16 | | 9 | 18 | +----+------+ 9 rows in set (0.00 sec)

You see that the first four rows – which we accessed in the first transaction – were not modified, and last four were modified. If InnoDB only saved the cursor (as someone answered above) we would expect to see the same result if we ran SELECT * ...  query in our old transaction, but it actually shows whole table content before modification:

mysql1> select * from ti; +----+------+ | id | f1 | +----+------+ | 1 | 1 | | 2 | 2 | | 3 | 3 | | 4 | 4 | | 5 | 5 | | 6 | 6 | | 7 | 7 | | 8 | 8 | | 9 | 9 | +----+------+ 9 rows in set (0.00 sec)

So “snapshot”  is a better word than “cursor” for the result set. In the case of READ COMMITTED, the first transaction would see modified rows:

mysql1> drop table ti; Query OK, 0 rows affected (0.11 sec) mysql1> create table ti(id int not null primary key, f1 int) engine=innodb; Query OK, 0 rows affected (0.38 sec) mysql1> insert into ti values(1,1), (2,2), (3,3), (4,4), (5,5), (6,6), (7,7), (8,8), (9,9); Query OK, 9 rows affected (0.04 sec) Records: 9 Duplicates: 0 Warnings: 0 mysql1> set transaction isolation level read committed; Query OK, 0 rows affected (0.00 sec) mysql1> begin; Query OK, 0 rows affected (0.00 sec) mysql1> select * from ti where id < 5; +----+------+ | id | f1 | +----+------+ | 1 | 1 | | 2 | 2 | | 3 | 3 | | 4 | 4 | +----+------+ 4 rows in set (0.00 sec)

Let’s update all rows in the table this time:

mysql2> update ti set f1 = id*2; Query OK, 9 rows affected (0.04 sec) Rows matched: 9 Changed: 9 Warnings: 0

Now the first transaction sees both the modified rows with id >= 5 (not in the initial result set), but also the modified rows with id < 5 (which existed in the initial result set):

mysql1> select * from ti; +----+------+ | id | f1 | +----+------+ | 1 | 2 | | 2 | 4 | | 3 | 6 | | 4 | 8 | | 5 | 10 | | 6 | 12 | | 7 | 14 | | 8 | 16 | | 9 | 18 | +----+------+ 9 rows in set (0.00 sec)

MariaDB 10.2 CHECK and DEFAULT clauses

July 29, 2016 - 12:35pm

In this blog post, we’ll look at the MariaDB 10.2 CHECK and DEFAULT clauses.

MariaDB 10.2 includes some long-awaited features. In this blog, we are going to discuss the improvements to some table definitions: the DEFAULT clause and the CHECK constraints. These clauses describe columns default values and rules for data validation.

Note that MariaDB 10.2 is still in alpha stage. This article describes the current state of these features, which could change before MariaDB 10.2 becomes GA.

The DEFAULT clause

The DEFAULT clause has always been supported in MariaDB/MySQL, but traditionally it only accepted literal values (like “hello world” or “2”). MariaDB 10.2 removes this limitation, so DEFAULT can now accept most SQL expressions. For example:

  • fiscal_year SMALLINT DEFAULT (YEAR(NOW()))
  • valid_until DATE DEFAULT (NOW() + INTERVAL 1 YEAR)
  • owner VARCHAR(100) DEFAULT (USER())

Additionally, MariaDB 10.2 allows you to set a DEFAULT value for the TEXT and BLOB columns. This was not possible in previous versions. While this might look like a small detail, it can be hard to add a column to an existing table that is used by production applications, if it cannot have a default value.

The DEFAULT clause has some very reasonable limitations. For example, it cannot contain a subquery or a stored function. An apparently strange limitation is that we can mention another column in DEFAULT only if it comes first in the CREATE TABLE command.

Note that DEFAULT can make use of non-deterministic functions even if the binary log uses the STATEMENT format. In this case, default non-deterministic values will be logged in the ROW format.

CHECK constraints

CHECK constraints are SQL expressions that are checked when a row is inserted or updated. If this expression result is false (0, empty string, empty date) or NULL, the statement will fail with an error. The error message states which CHECK failed in a way that is quite easy to parse:

ERROR 4022 (23000): CONSTRAINT `consistent_dates` failed for `test`.`author`

Some example of CHECK constraints:

  • CONSTRAINT non_empty_name CHECK (CHAR_LENGTH(name) > 0)
  • CONSTRAINT consistent_dates CHECK (birth_date IS NULL OR death_date IS NULL OR birth_date < death_date)
  • CONSTRAINT past_date CHECK (birth_date < NOW())

A possible trick is checking that a column is different from its default value. This forces users to assign values explicitly.

CHECK constraints cannot be added or altered. It is only possible to drop them. This is an important limitation for production servers.

Another limitation is that CHECK metadata are not accessible via the INFORMATION_SCHEMA. The only way to find out if a table has CHECK clauses is parsing the output of SHOW CREATE TABLE.

The exact behavior of CHECK constraints in a replication environment depends on the master binary log format. If it is STATEMENT, the slaves will apply CHECK constraints to events received from the master. If it is ROW, only the master will need to apply constraints, because failed statements will not be replicated.

Thus, in all cases, we recommend having identical constraints on master and slaves, and only using deterministic constraints.

Performance

While I didn’t run a professional benchmark, I can say that both DEFAULT and CHECK clauses don’t have a noticeable impact on a simple test where we insert one million rows (on my local machine).

However, these clauses evaluate an SQL expression each time a row is inserted or updated. The overhead is at least equal to the SQL expression performance. If high-performing writes are important, you will probably not want to use complex data validation.

To check how fast an expression is, we can use the BENCHMARK() function:

MariaDB [(none)]> SELECT BENCHMARK(10000000, (555 / 100 * 20)); +---------------------------------------+ | BENCHMARK(10000000, (555 / 100 * 20)) | +---------------------------------------+ | 0 | +---------------------------------------+ 1 row in set (1.36 sec) MariaDB [(none)]> SELECT BENCHMARK(100000000, MD5('hello world')); +------------------------------------------+ | BENCHMARK(100000000, MD5('hello world')) | +------------------------------------------+ | 0 | +------------------------------------------+ 1 row in set (14.84 sec)

In this example, we executed the specified expressions ten million times. BENCHMARK() always returns 0, but what we want to check is the execution time. We can see for example that evaluating MD5(‘hello world’) takes less than 0.000002 seconds. In some cases, we may want to retry the same expressions with different parameters (longer strings, higher numbers, etc.) to check if the execution times varies.

Unfortunately, we don’t have a status variable which tells us how many times MariaDB evaluated CHECK clauses. If our workload performs many writes, that variable could help us to find out if CHECK constraints are slowing down inserts. Maybe the MariaDB team can take this as a suggestion for the future.

Percona Monitoring and Management 1.0.2 Beta

July 28, 2016 - 12:39pm

Percona announces the release of Percona Monitoring and Management 1.0.2 Beta on 28 July 2016.

Like prior versions, PMM is distributed through Docker Hub and is free to download. Full instructions for download and installation of the server and client are available in the documentation.

Notable changes to the tool include:

  • Upgraded to Grafana 3.1.0.
  • Upgraded to Prometheus 1.0.1.
  • Set default metrics retention to 30 days.
  • Eliminated port 9001. Now the container uses only one configurable port, 80 by default.
  • Eliminated the need to specify ADDRESS variable when creating Docker container.
  • Completely re-wrote pmm-admin with more functions.
  • Added ability to stop all services using the new pmm-admin.
  • Added support to name instances using the new pmm-admin.
  • Query Analytics Application updates:
    • Redesigned queries profile table
    • Redesigned metrics table
    • Redesigned instance settings page
    • Added sparkline charts
    • Added ability to show more than ten queries
  • Various updates for MongoDB dashboards.

The full release notes are available in the documentation. The documentation also includes details on installation and architecture.

A demonstration of the tool has been set up at pmmdemo.percona.com.

We have also implemented forums for the discussion of PMM.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

Some screen shots of the updates:

Note the new sparkline that shows the current load in context (so you know if the number is higher/normal/lower than normal), and the option to “Load next 10 queries” at the bottom of the listing.

Our admin tool was completely re-written with new functions:

pmm-admin –help output

 

pmm-admin list command output

 

pmm-admin check-network output, which provides information on the status of the client’s network connection to the server.

Upcoming Webinar August 2 10:00 am PDT: MySQL and Ceph

July 28, 2016 - 11:27am

Join Brent Compton, Kyle Bader and Yves Trudeau on August 2, 2016 at 10 am PDT (UTC-7) for a MySQL and Ceph webinar.

Many operators select OpenStack as their control plane of choice for providing both internal and external IT services. The OpenStack user survey repeatedly shows Ceph as the dominant backend for providing persistent storage volumes through OpenStack Cinder. When building applications and repatriating old workloads, developers are discovering the need to provide OpenStack infrastructure database services. Given MySQL’s ubiquity, and it’s reliance on persistent storage, it is of utmost importance to understand how to achieve the performance demanded by today’s applications. Databases like MySQL can be incredibly IO intensive, and Ceph offers a great opportunity to go beyond the limitations presented by a single scale-up system. Since Ceph provides a mutable object store with atomic operations, could MySQL store InnoDB pages directly in Ceph?

This talk reviews the general architecture of Ceph, and then discusses benchmark results from small to mid-size Ceph clusters. These benchmarks lead to the development of prescriptive guidance around tuning Ceph storage nodes (OSDs), the impact the amount of physical memory, and the presence of SSDs, high-speed networks or RAID controllers.

Click here to register now. Speakers: Brent Compton
Director Storage Solution Architectures, Red Hat
Brent Compton is Director Storage Solution Architectures at Red Hat. He leads the team responsible for building Ceph and Gluster storage reference architectures with Red Hat Storage partners. Before Red Hat, Brent was responsible for emerging non-volatile memory software technologies at Fusion-io. Previous enterprise software leadership roles include VP Product Management at Micromuse (now IBM Tivoli Netcool) and Product Marketing Director within HP’s OpenView software division. Brent also served as Director Middleware Development Platforms at the LDS Church and as CIO at Joint Commission International. Brent has a tight-knit family, and can be found on skis or a mountain bike whenever possible. Kyle Bader
Sr Solution Architect, Red Hat
Kyle Bader, a Red Hat senior architect, provides expertise in the design and operation of petabyte-scale storage systems using Ceph. He joined Red Hat as part of the 2014 Inktank acquisition. As a senior systems engineer at DreamHost, he helped implement, operate, and design Ceph and OpenStack-based systems for DreamCompute and DreamObjects cloud products. Yves Trudeau
Principal Architect
Yves is a Principal Consultant at Percona, specializing in MySQL High-Availability and scaling solutions. Before joining Percona in 2009, he worked as a senior consultant for MySQL AB and Sun Microsystems, assisting customers across North America with NDB Cluster and Heartbeat/DRBD technologies. Yves holds a Ph.D. in Experimental Physics from Université de Sherbrooke. He lives in Québec, Canada with his wife and three daughters.

Percona Live Europe Amsterdam 2016 Tutorial Schedule is Up!

July 27, 2016 - 3:09pm

This blog post lists the Percona Live Europe Amsterdam 2016 tutorial schedule.

We are excited to announce that the tutorial schedule for the Percona Live Europe Amsterdam Open Source Database Conference 2016 is up!

The Percona Live Europe Amsterdam Open Source Database Conference is the premier event for the diverse and active open source community, as well as businesses that develop and use open source software. The conferences have a technical focus with an emphasis on the core topics of MySQL, MongoDB, and other open source databases. Tackling subjects such as analytics, architecture and design, security, operations, scalability and performance, Percona Live provides in-depth discussions for your high-availability, IoT, cloud, big data and other changing business needs.

Tackling subjects such as analytics, architecture and design, security, operations, scalability and performance, Percona Live Europe provides in-depth discussions for your high-availability, IoT, cloud, big data and other changing business needs. This conference is an opportunity to network with peers and technology professionals by bringing together accomplished DBA’s, system architects and developers from around the world to share their knowledge and experience – all to help you learn how to tackle your open source database challenges in a whole new way. These tutorials are a must for any data performance professional!

The Percona Live Europe Open Source Database Conference is October 3-5 at the Mövenpick Hotel Amsterdam City Centre.

Click through to the tutorial link right now, look them over, and pick which sessions you want to attend. Discounted passes available below!

Tutorial List: Early Bird Discounts

Just a reminder to everyone out there: our Early Bird discount rate for the Percona Live Europe Amsterdam Open Source Database Conference is only available ‘til August 8, 2016, 11:30 pm PST! This rate gets you all the excellent and amazing opportunities that Percona Live offers, at a very reasonable price!

Sponsor Percona Live

Become a conference sponsor! We have sponsorship opportunities available for this annual MySQL, MongoDB and open source database event. Sponsors become a part of a dynamic and growing ecosystem and interact with hundreds of DBAs, sysadmins, developers, CTOs, CEOs, business managers, technology evangelists, solutions vendors, and entrepreneurs who attend the event.

Monitoring MongoDB with Nagios

July 27, 2016 - 7:08am

In this blog, we’ll discuss monitoring MongoDB with Nagios.

There is a significant amount of talk around graphing MongoDB metrics using things like Prometheus, Data Dog, New Relic, and Ops Manager from MongoDB Inc. However, I haven’t noticed a lot of talk around “What MongoDB alerts should I be setting up?”

While building out Percona’s remote DBA service for MongoDB, I looked at Prometheus’s AlertManager. After reviewing it, I’m not sure it’s quite ready to be used exclusively. We needed to decide quickly if there are better Nagios checks on the market, or did I need to write my own?

In the end, we settled on a hybrid approach. There are some good frameworks, but we need to create or tweak some of the things needed for an “SEV 1-” or “SEV 2-” type issue (which are most important to me). One of the most common problems for operations, Ops, DevOps, DBA teams and most engineering is alert spam. As such I wanted to be very careful to only alert on the things pointing to immediate dangers or current outages. As a result, we have now added pmp-check-mongo.py to the GitHub for Percona Monitoring Plugins. Since we use Grafana and Prometheus for metrics and graphing, there are no accompanying Catci information templates. In the future, we’ll need to decide how this will change PMP overtime. In the meantime, we wanted to make the tool available now and worry about some of the issues later on.

As part of this push, I want to give you some real world examples of how you might use this tool. There are many options available to you, and Nagios is still a bit green in regards to making those options as user-friendly as our tools are.

Usage: pmp-check-mongo.py [options] Options: -h, --help show this help message and exit -H HOST, --host=HOST The hostname you want to connect to -P PORT, --port=PORT The port mongodb is running on -u USER, --user=USER The username you want to login as -p PASSWD, --password=PASSWD The password you want to use for that user -W WARNING, --warning=WARNING The warning threshold you want to set -C CRITICAL, --critical=CRITICAL The critical threshold you want to set -A ACTION, --action=ACTION The action you want to take. Valid choices are (check_connections, check_election, check_lock_pct, check_repl_lag, check_flushing, check_total_indexes, check_balance, check_queues, check_cannary_test, check_have_primary, check_oplog, check_index_ratio, check_connect) Default: check_connect -s SSL, --ssl=SSL Connect using SSL -r REPLICASET, --replicaset=REPLICASET Connect to replicaset -c COLLECTION, --collection=COLLECTION Specify the collection in check_cannary_test -d DATABASE, --database=DATABASE Specify the database in check_cannary_test -q QUERY, --query=QUERY Specify the query, only used in check_cannary_test --statusfile=STATUS_FILENAME File to current store state data in for delta checks --backup-statusfile=STATUS_FILENAME_BACKUP File to previous store state data in for delta checks --max-stale=MAX_STALE Age of status file to make new checks (seconds)

There seems to be a huge amount going on here, but let’s break it down into a few categories:

  • Connection options
  • Actions
  • Action options
  • Status options

Hopefully, this takes some of the scariness out of the script above.

Connection options
  • Host / Port Number
    • Pretty simple, this is just the host you want to connect to and what TCP port it is listening on.
  • Username and Password
    • Like with Host/Port, this is some of your normal and typical Mongo connection field options. If you do not set both the username and password, the system will assume auth was disabled.
  • SSL
    • This is mostly around the old SSL support in Mongo clients (which was a boolean). This tool needs updating to support the more modern SSL connection options. Use this as a “deprecated” feature that might not work on newer versions.
  • ReplicaSet
    • Very particular option that is only used for a few checks and verifies that the connection uses a replicaset connection. Using this option lets the tool automatically find a primary node for you, and is helpful to some checks specifically around replication and high availability (HA):
      • check_election
      • check_repl_lag
      • check_cannary_test
      • chech_have_primary
      • check_oplog
Actions and what they mean
  • check_connections
    • This parameter refers to memory usage, but beyond that you need to know if your typical connections suddenly double. This indicates something unexpected happened in the application or database and caused everything to reconnect. It often takes up to 10 minutes for those old connections to go away.
  • check_election
    • This uses the status file options we will cover in a minute, but it checks to see if the primary from the last check differs from the current found primary. If so, it alerts. This check should only have a threshold of one before it alarms (as an alert means an HA event occurred).
  • check_lock_pct
    • MMAP only, this engine has a write lock on the whole collection/database depending on the version. This is a crucial metric to determine if MMAP writes are blocking reads, meaning you need to scale the DB layer in some way.
  • check_repl_lag
    • Checks the replication stream to understand how lagged a given node is the primary. To accomplish this, it uses a fake record in the test DB to cause a write. Without this, a read-only system would look lagged artificially as no new oplog entries get created.
  • check_flushing
    • A common issue with MongoDB is very long flush times, causing a system halt. This is a caused by your disk subsystem not keeping up, and then the DB having to wait on flushing to make sure writes get correctly journaled.
  • check_total_indexes
    • The more indexes you have, the more the planner has to work to determine which index is a good fit. This increases the risk that the recovery of a failure will take a long time. This is due to the way a restore builds indexes and how MongoDB can only make one index at a time.
  • check_balance
    • While MongoDB should keep things in balance across a cluster, many things can happen: jumbo chunks, a disabled balancer being, constantly attempting to move the same chunk but failing, and even adding/removing sharding. This alert is for these cases, as an imbalance means some records might get served faster than others. It is purely based on the chunk count that the MongoDB balancer is also based on, which is not necessarily the same as disk usage.
  • check_queues
    • No matter what engine you have selected, a backlog of sustained reads or writes indicates your DB layer is unable to keep up with demand. It is important in these cases to send an alert if the rate is maintained. You might notice this is also in our Prometheus exporter for graphics as both trending and alerting are necessary to watch in a MongoDB system.
  • check_cannary_test
    • This is a typical query for the database and then used to set critical/warning levels based on the latency of the returned query. While not as accurate as full synthetic transactions, queries through the application are good to measure response time expectations and SLAs.
  • check_have_primary
    • If we had an HA event but failed to get back up quickly, it’s important to know if a new primary is causing writes to error on the system. This check simply determines if the replica set has a primary, which means it can handle reads and writes.
  • check_oplog
    • This check is all about how much oplog history you have. This is much like measuring how much history you have in MySQL blogs. The reason this is important is when recovering from a backup and performing a point in time recovery, you can use the current oplog if the oldest timestamp in the oplog is newer than the backup timestamp. As a result, this is normal three times the backup interval you use to guarantee that you have plenty of time to find the newest recovery and then do the recovery.
  • check_index_ratio
    • This is an older metric that modern MongoDB versions will not find useful, but in the past, it was a good way to understand the percentage of queries not handled by an index.
  • check_connect
    • A very basic check to ensure it can connect (and optionally login) to MongoDB and verify the server is working.
Status File options

These options rarely need to be changed but are present in case you want to store the status on an SHM mount point to avoid actual disk writes.

  • statusfile
    • This is where a copy of the current rs.status, serverStatus and other command data is stored
  • backup-statusfile
    • Like status_file, but status_file is moved here when a new check is done. These two objects can then be compared to find the delta between two checkpoints.
  • max-stale
    • This is the amount of age for which an old file is still valid. Deltas older then this aren’t allowed and exist to protect the system from will assumption when a statusfile is hours or days old.

If you have any questions on how to use these parameters, feel free to let us know. In the code, there is also a defaults dictionary for most of these options so that in many cases setting warning and critical level are not needed.



General Inquiries

For general inquiries, please send us your question and someone will contact you.