You are here

Latest MySQL Performance Blog posts

Subscribe to Latest MySQL Performance Blog posts feed
Percona's MySQL & InnoDB performance and scalability blog
Updated: 42 min 29 sec ago

OpenStack: A MySQL DBA Perspective – Sept. 17 webinar

September 11, 2014 - 1:58pm

OpenStack: A MySQL DBA Perspective

I’ll have the pleasure to present, next Wednesday, September 17 at 10 a.m. PDT (1 p.m. EDT) a webinar titled “OpenStack: A MySQL DBA Perspective.” Everyone is invited.

The webinar will be divided into two parts. The first part will cover how MySQL can be used by the OpenStack infrastructure including the expected load, high-availability solutions and geo-DR.

The second part will focus on the use of MySQL within an OpenStack cloud. We’ll look into the various options that are available, the traditional ones and Trove. We’ll also discuss the block device options in regards with MySQL performance and, finally, we’ll review the high-availability implications of running MySQL in an OpenStack cloud.

Register here. I look forward to your questions, and if you have any related to OpenStack that I can help with in advance of the webinar please feel free to post those in the comments section below. I’ll write a followup post after the webinar to recap all related questions and answers. I’ll also provide the slides.

See you next Wednesday!

The post OpenStack: A MySQL DBA Perspective – Sept. 17 webinar appeared first on MySQL Performance Blog.

OpenStack users shed light on Percona XtraDB Cluster deadlock issues

September 11, 2014 - 8:27am

I was fortunate to attend an Ops discussion about databases at the OpenStack Summit Atlanta this past May as one of the panelists. The discussion was about deadlock issues OpenStack operators see with Percona XtraDB Cluster (of course this is applicable to any Galera-based solution). I asked to describe what they are seeing, and as it turned out, nova and neutron uses the SELECT … FOR UPDATE SQL construct quite heavily. This is a topic I thought was worth writing about.

Write set replication in a nutshell (with oversimplification)

Any node is writable, and replication happens in write sets. A write set is practically a row based binary log event or events and “some additional stuff.” The “some additional stuff” is good for 2 things.

  • Two write sets can be compared and told if they are conflicting or not.
  • A write set can be checked against a database if it’s applicable.

Before committing on the originating node, the write set is transferred to all other nodes in the cluster. The originating node checks that the transaction is not conflicting with any of the transactions in the receive queue and checks if it’s applicable to the database. This process is called certification. After the write set is certified the transaction is committed. The remote nodes will do certification asynchronously compared to the local node. Since the certification is deterministic, they will get the same result. Also the write set on the remote nodes can be applied later because of this reason. This kind of replication is called virtually synchronous, which means that the data transfer is synchronous, but the actual apply is not.

We have a nice flowchat about this.

Since the write set is only transferred before commit, InnoDB row level locks, which are held locally, are not held on remote nodes (if these were escalated, each row lock would take a network round trip to acquire). This also means that by default if multiple nodes are used, the ability to read your own writes is not guaranteed. In that case, a certified transaction, which is already committed on the originating node can still sit in the receive queue of the node the application is reading from, waiting to be applied.


The SELECT … FOR UPDATE construct reads the given records in InnoDB, and locks the rows that are read from the index the query used, not only the rows that it returns. Given how write set replication works, the row locks of SELECT … FOR UPDATE are not replicated.

Putting it together

Let’s create a test table.


And some records we can lock.

pxc1> insert into t values(); Query OK, 1 row affected (0.01 sec) pxc1> insert into t values(); Query OK, 1 row affected (0.01 sec) pxc1> insert into t values(); Query OK, 1 row affected (0.01 sec) pxc1> insert into t values(); Query OK, 1 row affected (0.00 sec) pxc1> insert into t values(); Query OK, 1 row affected (0.01 sec)

pxc1> select * from t; +----+---------------------+ | id | ts | +----+---------------------+ | 1 | 2014-06-26 21:37:01 | | 4 | 2014-06-26 21:37:02 | | 7 | 2014-06-26 21:37:02 | | 10 | 2014-06-26 21:37:03 | | 13 | 2014-06-26 21:37:03 | +----+---------------------+ 5 rows in set (0.00 sec)

On the first node, lock the record.

pxc1> start transaction; Query OK, 0 rows affected (0.00 sec) pxc1> select * from t where id=1 for update; +----+---------------------+ | id | ts | +----+---------------------+ | 1 | 2014-06-26 21:37:01 | +----+---------------------+ 1 row in set (0.00 sec)

On the second, update it with an autocommit transaction.

pxc2> update t set ts=now() where id=1; Query OK, 1 row affected (0.01 sec) Rows matched: 1 Changed: 1 Warnings: 0 pxc1> select * from t; ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

Let’s examine what happened here. The local record lock held by the started transation on pxc1 didn’t play any part in replication or certification (replication happens at commit time, there was no commit there yet). Once the node received the write set from pxc2, that write set had a conflict with a transaction still in-flight locally. In this case, our transaction on pxc1 has to be rolled back. This is a type of conflict as well, but here the conflict is not caught on certification time. This is called a brute force abort. This happens when a transaction done by a slave thread conflict with a transaction that’s in-flight on the node. In this case the first commit wins (which is the already replicated one) and the original transaction is aborted. Jay Janssen discusses multi-node writing conflicts in detail in this post.

The same thing happens when 2 of the nodes are holding record locks via select for update. Whichever node commits first will win, the other transaction will hit the deadlock error and will be rolled back. The behavior is correct.

Here is the same SELECT … FOR UPDATE transaction overlapping on the 2 nodes.

pxc1> start transaction; Query OK, 0 rows affected (0.00 sec) pxc2> start transaction; Query OK, 0 rows affected (0.00 sec)

pxc1> select * from t where id=1 for update; +----+---------------------+ | id | ts | +----+---------------------+ | 1 | 2014-06-26 21:37:48 | +----+---------------------+ 1 row in set (0.00 sec) pxc2> select * from t where id=1 for update; +----+---------------------+ | id | ts | +----+---------------------+ | 1 | 2014-06-26 21:37:48 | +----+---------------------+ 1 row in set (0.00 sec)

pxc1> update t set ts=now() where id=1; Query OK, 1 row affected (0.01 sec) Rows matched: 1 Changed: 1 Warnings: 0 pxc2> update t set ts=now() where id=1; Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0

pxc1> commit; Query OK, 0 rows affected (0.00 sec) pxc2> commit; ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

Where does this happen in OpenStack?

For example in OpenStack Nova (the compute project in OpenStack), tracking the quota usage uses the SELECT…FOR UPDATE construct.

# User@Host: nova[nova] @ [] Id: 147 # Schema: nova Last_errno: 0 Killed: 0 # Query_time: 0.001712 Lock_time: 0.000000 Rows_sent: 4 Rows_examined: 4 Rows_affected: 0 # Bytes_sent: 1461 Tmp_tables: 0 Tmp_disk_tables: 0 Tmp_table_sizes: 0 # InnoDB_trx_id: C698 # QC_Hit: No Full_scan: Yes Full_join: No Tmp_table: No Tmp_table_on_disk: No # Filesort: No Filesort_on_disk: No Merge_passes: 0 # InnoDB_IO_r_ops: 0 InnoDB_IO_r_bytes: 0 InnoDB_IO_r_wait: 0.000000 # InnoDB_rec_lock_wait: 0.000000 InnoDB_queue_wait: 0.000000 # InnoDB_pages_distinct: 2 SET timestamp=1409074305; SELECT quota_usages.created_at AS quota_usages_created_at, quota_usages.updated_at AS quota_usages_updated_at, quota_usages.deleted_at AS quota_usages_deleted_at, quota_usages.deleted AS quota_usages_deleted, quota_usages.id AS quota_usages_id, quota_usages.project_id AS quota_usages_project_id, quota_usages.user_id AS quota_usages_user_id, quota_usages.resource AS quota_usages_resource, quota_usages.in_use AS quota_usages_in_use, quota_usages.reserved AS quota_usages_reserved, quota_usages.until_refresh AS quota_usages_until_refresh FROM quota_usages WHERE quota_usages.deleted = 0 AND quota_usages.project_id = '12ce401aa7e14446a9f0c996240fd8cb' FOR UPDATE;

So where does it come from?

These constructs are generated by SQLAlchemy using with_lockmode(‘update’). Even in nova’s pydoc, it’s recommended to avoid with_lockmode(‘update’) whenever possible. Galera replication is not mentioned among the reasons to avoid this construct, but knowing how many OpenStack deployments are using Galera for high availability (either Percona XtraDB Cluster, MariaDB Galera Cluster, or Codership’s own mysql-wsrep), it can be a very good reason to avoid it. The solution proposed in the linked pydoc above is also a good one, using an INSERT INTO … ON DUPLICATE KEY UPDATE is a single atomic write, which will be replicated as expected, it will also keep correct track of quota usage.

The simplest way to overcome this issue from the operator’s point of view is to use only one writer node for these types of transactions. This usually involves configuration change at the load-balancer level. See this post for possible load-balancer configurations.

The post OpenStack users shed light on Percona XtraDB Cluster deadlock issues appeared first on MySQL Performance Blog.

Generating test data from the mysql> prompt

September 10, 2014 - 1:00am

There are a lot of tools that generate test data.  Many of them have complex XML scripts or GUI interfaces that let you identify characteristics about the data. For testing query performance and many other applications, however, a simple quick and dirty data generator which can be constructed at the MySQL command line is useful.

First, let’s talk about what kind of data you can easily create with MySQL function calls:

You can generate a decimal number between zero and another number using the MySQL RAND() function like the following query (here between 0 and 10000):

SELECT RAND() * 10000;

Similarly, you can generate a random integer by adding the FLOOR() function:


You can generate a random string of 32 characters using MD5():

SELECT MD5(RAND() * 10000)

You can return a random integer between 500 and 1000 with the following:

SELECT FLOOR( 500 + RAND() * (1000 - 500))

You can return a random string from a list of strings by using a table to hold the list. A subselect can select a random name from the list of names.

create table names(id int auto_increment primary key, name varchar(20)); insert into names (name) values ('Justin','Jerry','James','Josh','Julien'); select (select name from names where id = 1 + rand() * 4);

Now we can generate a “fact” table with many rows using fairly simple SQL statements.

First create a table to generate data into:

CREATE TABLE fact ( dim1 int, dim2 int, name varchar(20), hash varchar(32), measure1 double );

Seed the table with one initial row:

INSERT INTO fact VALUES (1,1,'Justin',md5(''), .1);

Now grow the table by selecting from the table but providing new random values for the inserted rows:

INSERT INTO fact SELECT FLOOR(1+ rand()*9999), FLOOR(1 + rand()*499), (select name from names where id = 1 + rand() * 4), MD5(1+rand()*9999), rand() FROM fact;

As you repeat the INSERT … SELECT, the table will grow exponentially. You may want to add a LIMIT clause to the INSERT … SELECT to reduce the amount of data generated as the table grows.

You will create a table with an even data distribution for each column. You can then add some queries to add skew, either using INSERT … SELECT or UPDATE, for example:

INSERT INTO fact SELECT 1,1,'Justin',md5(''), .1 FROM fact LIMIT 10000;

That will skew the values by creating many rows with the same data as our initial row.

Using these simple tools, you can generate a data set that is great for testing purposes. For example, dim1 might be a customer_id and dim2 a product_id, and you would populate those tables with 10000 and 500 rows, respectively.

The post Generating test data from the mysql> prompt appeared first on MySQL Performance Blog.

Multi-Valued INSERTs, AUTO_INCREMENT & Percona XtraDB Cluster

September 9, 2014 - 7:38am

A common migration path from standalone MySQL/Percona Server to a Percona XtraDB Cluster (PXC) environment involves some measure of time where one node in the new cluster has been configured as a slave of the production master that the cluster is slated to replace. In this way, the new cluster acts as a slave of the production environment – traditional replication takes care of getting the data into the cluster, and then Galera replication handles the intra-cluster traffic. This often works without issue, although there is one case that I’ve encountered recently where special care must be taken to properly configure the stream to ensure that replication does not break. If you use multi-valued inserts with auto-increment columns, then this post is for you.

For purposes of our discussion, assume that we have a basic 3-node PXC cluster that we’ve set up using the PXC Reference Architecture document, and that we’re replicating from an asynchronous master (call it “server A”) into one of the PXC nodes. Without loss of generality, we’ll pick PXC03. Also, for purposes of our discussion, we’ll be working with the following table definition:

serverA> show create table auto_inc_test; CREATE TABLE `auto_inc_test` ( `i` int(11) NOT NULL AUTO_INCREMENT, `stuff` varchar(20) DEFAULT NULL, PRIMARY KEY (`i`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 1 row in set (0.00 sec) serverA> SELECT * FROM auto_inc_test; Empty set (0.00 sec)

If we insert rows into this table one at a time, we have no issues.

serverA> INSERT INTO auto_inc_test(stuff) VALUES ('first row'); serverA> INSERT INTO auto_inc_test(stuff) VALUES ('second row'); serverA> INSERT INTO auto_inc_test(stuff) VALUES ('third row'); serverA> SELECT * FROM auto_inc_test; +---+------------+ | i | stuff      | +---+------------+ | 1 | first row  | | 2 | second row | | 3 | third row | +---+------------+ PXC03> SELECT * FROM auto_inc_test; +---+------------+ | i | stuff      | +---+------------+ | 1 | first row  | | 2 | second row | | 3 | third row | +---+------------+

But if we start doing multi-valued inserts, we can run into a problem.

serverA> INSERT INTO auto_inc_test(stuff) VALUES('first row'),('second row'),('third row'); serverA> INSERT INTO auto_inc_test(stuff) VALUES('fourth row'),('fifth row'); serverA> SELECT * FROM auto_inc_test; +---+------------+ | i | stuff      | +---+------------+ | 1 | first row  | | 2 | second row | | 3 | third row | | 4 | fourth row | | 5 | fifth row | +---+------------+ PXC03> SELECT * FROM auto_inc_test; +---+------------+ | i | stuff      | +---+------------+ | 1 | first row  | | 2 | second row | | 5 | third row | +---+------------+ PXC03> SHOW SLAVE STATUS; ... output elided ... Last_SQL_Errno: 1062 Last_SQL_Error: Error 'Duplicate entry '5' for key 'PRIMARY'' on query. Default database: 'test'. Query: 'INSERT INTO auto_inc_test (stuff) VALUES ('fourth row'),('fifth row')' ... output elided ...

Uh oh. Replication is broken and our data is now inconsistent. So why does this happen and how can we prevent it?


The astute observer will note that I have not yet said anything about the binary log format on the master. If the binary log format on the master is already set to ROW, then the above error will not occur. RBR will properly replicate multi-valued INSERTs to the PXC cluster without issue, and the data will be consistent. Problem solved. However, there may be reasons that the master is not using or cannot use RBR, such as disk space or IOPS limitations, and thus it’s running in MIXED or STATEMENT mode. In that case, we need to look elsewhere….


When set to ON (the default), this variable has the effect of automatically specifying values for auto_increment_increment and auto_increment_offset based on the cluster size. The idea behind it is to help prevent auto-increment value conflicts when writing to multiple nodes. However, what it also means is that in a multi-node cluster, the auto-increment values generated by any given node will never be consecutive and the “next” auto-increment value on the slave cluster node will always be higher than what the master believes it should be. For example:

serverA> INSERT INTO auto_inc_test (stuff) VALUES ('first row'),('second row'),('third row'); serverA> SHOW CREATE TABLE auto_inc_test; CREATE TABLE `auto_inc_test` ( `i` int(11) NOT NULL AUTO_INCREMENT,   `stuff` varchar(20) DEFAULT NULL,   PRIMARY KEY (`i`) ) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=latin1 PXC03> SHOW CREATE TABLE auto_inc_test; CREATE TABLE `auto_inc_test` (   `i` int(11) NOT NULL AUTO_INCREMENT,   `stuff` varchar(20) DEFAULT NULL,   PRIMARY KEY (`i`) ) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=latin1

Hmm, that’s not good.

The above scenario can be avoided by setting wsrep_auto_increment_control to OFF on the node that’s acting as a slave while the cluster is still receiving asynchronous replication traffic. This can be configured in /etc/my.cnf or set dynamically from the MySQL command line with:

SET GLOBAL wsrep_auto_increment_control='OFF';

In the testing that I’ve done, this appears to be sufficient to deal with the problem, even if the original master has non-standard values specified for the auto_increment_* variables (e.g., because it’s part of a master-master pair where the odd ID numbers are generated by one server and the even ID numbers from the other one).

In fact, if the cluster is always going to be used in single-writer mode, there’s a compelling argument to be made for setting this variable to OFF on all of the cluster nodes even when going into production: it will prevent the cluster from burning through N auto-increment IDs every time a single row is inserted.

Fixing it when it’s broken

There’s no secret magic here. Percona Toolkit’s pt-table-checksum and pt-table-sync can be used to check and repair the data divergence between the master and the slave cluster node, just as if PXC were not involved, although you may find that it’s just faster to rebuild the cluster, depending upon how many diffs are found. Sometimes this issue happens right away when the master-to-cluster replication is started, and fixing it involves just one or two rows; other times I have seen it not occur for days with lots of rows out of sync once it finally breaks.

The tl;dr version

In sum, if you’re doing traditional replication into a PXC cluster (or any other flavor of MySQL/MariaDB + Galera), you may have issues with multi-valued INSERT statements, and this can be prevented with a configuration change on either side of the replication stream:

  • On the master, set binlog_format=ROW
  • On the PXC slave node, set wsrep_auto_increment_control=OFF

The post Multi-Valued INSERTs, AUTO_INCREMENT & Percona XtraDB Cluster appeared first on MySQL Performance Blog.

How to calculate the correct size of Percona XtraDB Cluster’s gcache

September 8, 2014 - 7:12am

When a write query is sent to Percona XtraDB Cluster all the nodes store the writeset on a file called gcache. By default the name of that file is galera.cache and it is stored in the MySQL datadir. This is a very important file, and as usual with the most important variables in MySQL, the default value is not good for high-loaded servers. Let’s see why it’s important and how can we calculate a correct value for the workload of our cluster.

What’s the gcache?
When a node goes out of the cluster (crash or maintenance) it obviously stops receiving changes. When you try to reconnect the node to the cluster the data will be outdated. The joiner node needs to ask a donor to send the changes happened during the downtime.

The donor will first try to transfer an incremental (IST), that is, the writesets the cluster received while the node was down. The donor checks the last writeset received by the joiner and then checks local gcache file. If all needed writesets are on that cache the donor sends them to the joiner. The joiner applies them and that’s all, it is up to date and ready to join the cluster. Therefore, IST can only be achieved if all changes missed by the node that went away are still in that gcache file of the donor.

On the other hand, if the writesets are not there a full transfer would be needed (SST) using one of the supported methods, XtraBackup, Rsync or mysqldump.

In a summary, the difference between a IST and SST is the time that a node needs to join the cluster. The difference could be from seconds to hours. In case of WAN connections and large datasets maybe days.

That’s why having a correct gcache is important. It work as a circular log, so when it is full it starts to rewrite the writesets at the beginning. With a larger gcache a node can be out of the cluster more time without requiring a SST. My colleague Jay Janssen explains in more detail about how IST works and how to find the right server to use as donor.

Calculating the correct size
When trick is pretty similar to the one used to calculate the correct InnoDB log file size. We need to check how many bytes are written every minute. The variables to check are:

wsrep_replicated_bytes: Total size (in bytes) of writesets sent to other nodes.

wsrep_received_bytes: Total size (in bytes) of writesets received from other nodes.

mysql> show global status like 'wsrep_received_bytes'; show global status like 'wsrep_replicated_bytes'; select sleep(60); show global status like 'wsrep_received_bytes'; show global status like 'wsrep_replicated_bytes'; +----------------------+----------+ | Variable_name | Value | +----------------------+----------+ | wsrep_received_bytes | 83976571 | +----------------------+----------+ +------------------------+-------+ | Variable_name | Value | +------------------------+-------+ | wsrep_replicated_bytes | 0 | +------------------------+-------+ [...] +----------------------+----------+ | Variable_name | Value | +----------------------+----------+ | wsrep_received_bytes | 90576957 | +----------------------+----------+ +------------------------+-------+ | Variable_name | Value | +------------------------+-------+ | wsrep_replicated_bytes | 800 | +------------------------+-------+


Bytes per minute:

(second wsrep_received_bytes – first wsrep_received_bytes) + (second wsrep_replicated_bytes – first wsrep_replicated_bytes)

(90576957 – 83976571) + (800 – 0) = 6601186 bytes or 6 MB per minute.

Bytes per hour:

6MB * 60 minutes = 360 MB per hour of writesets received by the cluster.

If you want to allow one hour of maintenance (or downtime) of a node, you need to increase the gcache to that size. If you want more time, just make it bigger.

The post How to calculate the correct size of Percona XtraDB Cluster’s gcache appeared first on MySQL Performance Blog.

Percona XtraDB Cluster 5.5.39-25.11 is now available

September 5, 2014 - 7:49am

Percona is glad to announce the new release of Percona XtraDB Cluster 5.5 on September 5th 2014. Binaries are available from downloads area or from our software repositories.

Based on Percona Server 5.5.39-36.0 including all the bug fixes in it, Galera Replicator 2.11, and on Codership wsrep API 25.11, Percona XtraDB Cluster 5.5.39-25.11 is now the current 5.5 General Availability release. All of Percona‘s software is open-source and free, and all the details of the release can be found in the 5.5.39-25.11 milestone at Launchpad.

New Features:

  • New session variable wsrep_sync_wait has been implemented to control causality check. The old session variable wsrep_causal_reads is deprecated but is kept for backward compatibility (#1277053).
  • systemd integration with RHEL/CentOS 7 is now available for Percona XtraDB Cluster from our testing repository (#1342223).

Bugs Fixed:

  • Percona XtraDB Cluster has implemented threadpool scheduling fixes. Bug fixed #1333348.
  • When gmcast.listen_addr was configured to a certain address, local connection point for outgoing connections was not bound to listen address. This would happen if OS has multiple interfaces with IP addresses in the same subnet, it may happen that OS would pick wrong IP for local connection point and other nodes would see connections originating from IP address which was not listened to. Bug fixed #1240964.
  • Client connections were closed unconditionally before generating SST request. Fixed by avoiding closing connections when wsrep is initialized before storage engines. Bug fixed #1258658.
  • Issue with re-setting galera provider (in wsrep_provider_options) has been fixed. Bug fixed #1260283.
  • Variable wsrep_provider_options couldn’t be set in runtime if no provider was loaded. Bug fixed #1260290.
  • Node consistency issues with foreign keys have been fixed. This fix introduces two new variables: wsrep_slave_FK_checks and wsrep_slave_UK_checks. These variables are set to TRUE and FALSE respectively by default. They control whether Foreign Key and Unique Key checking is done for applier threads. Bug fixed #1260713.
  • When FLUSH TABLES WITH READ LOCK was used on a node with wsrep_causal_reads set to 1 while there was a DML on other nodes then, subsequent SELECTs/SHOW STATUS didn’t hang earlier providing non-causal output, that has been fixed here. Bug fixed #1271177.
  • Lowest group communication layer (evs) would fail to handle the situation properly when big number of nodes would suddenly start to see each other. Bugs fixed #1271918 and #1249805.
  • Updating a unique key value could cause server hang if slave node has enabled parallel slaves. Bug fixed #1280896.
  • Fixed the events replication inconsistencies. Bug fixed #1312618.
  • Truncating the sorted version of multi-byte character conversion could lead to wsrep certification failures. Bug fixed #1314854.
  • wsrep_slave_threads was counted towards max_connections which could cause ERROR 1040 (HY000): Too many connections error. Bug fixed #1315588.
  • Leaving node was not set nonoperational if processed leave message originated from different view than the current one, which could cause other nodes to crash. Bug fixed #1323412 (#41).
  • garbd couldn’t be started with init script on RHEL 6.5. Bug fixed #1323652.
  • SST would fail when binlogs were in dedicated directory that’s located inside datadir. This bug was a regression introduced by bug fix for #1273368. Bug fixed #1326012.
  • GTID of TOI operations is now also synced to InnoDB tablespace in order to get consistent backups. Bug fixed #1329055.
  • mysql-debug (UNIV_DEBUG) is now distributed with binary tar.gz along with RPM and DEB packages. Bug fixed #1332073.
  • The restart sequence in scripts/mysql.server would fail to capture and return if the start call failed to start the server, so a restart could occur that failed upon start-up, and the script would still return 0 as if it worked without any issues. Bug fixed #1339894.
  • wsrep consistency check is now enabled for REPLACE ... SELECT as well. This was implemented because pt-table-checksum uses REPLACE .. SELECT during checksumming. Bug fixed #1343209.
  • A memory leak in wsrep_mysql_parse function has been fixed. Bug fixed #1345023.
  • SHOW STATUS was generating debug output in the error log. Bug fixed #1347818.
  • percona-xtradb-cluster-garbd-3.x package was installed incorrectly on Debian/Ubuntu. Bugs fixed #1360633 and #1334530.

Release notes for Percona XtraDB Cluster 5.5.39-25.11 are available in our online documentation along with the installation instructions.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

The post Percona XtraDB Cluster 5.5.39-25.11 is now available appeared first on MySQL Performance Blog.

Using MySQL 5.6 Global Transaction IDs (GTIDs) in production: Q&A

September 5, 2014 - 3:00am

Thank you to all of you who attended my webinar last week about Global Transaction IDs (GTIDs), which were introduced in MySQL 5.6 to make the reconfiguration of replication straightforward. If you missed my webinar, you can still listen to the recording and download the sides (free). We had a lot of questions during the webinar, so let me try to answer them here. Please let me know in the comments if additional clarification is needed.

Q: Does GTID provide any benefit to master-master replication? If yes, how?

A: GTIDs don’t change the nature of MySQL replication: for instance it is still asynchronous and if you write on both masters in a master-master setup, there is still no write conflict detection mechanism. The main benefit of GTIDs is that any change of the replication topology is easy because you no longer need to run complex calculations to find the right binlog positions when connecting a slave to a new master.
So master-master replication can be configured with GTIDs, it does not provide a lot of benefits compared to position-based replication as you will never change the replication topology.
And having a setup where both masters receive writes is still not recommended with GTIDs.

Q: Will GTIDs work well with master:standby master? How quick would the failover be?
A: Yes, GTIDs works well with this kind of setup (which is one master and one slave). GTIDs do not provide failover, you will have to use an external tool. Speed of failover then depends on the tool you will use.

Q: For already set up MASTER-MASTER/MASTER-SLAVE Replication, after getting GTID set up, we need to rebuild replication again using AUTO POS=1, correct?
A: Yes, using MASTER_AUTO_POSITION=1 is necessary to indicate that you will use GTID replication. So you will have to run: STOP SLAVE; CHANGE MASTER TO … MASTER_AUTO_POSITION = 1; START SLAVE;

Q: Application having tables from different Engines(InnoDB and MyISAM), how that will handled in GTID?
A: Transactions using both MyISAM and InnoDB tables are not allowed, please refer to the documentation

Q: In a master-slave replication topology (with GTID enabled), how does slave get data from the master if the master’s binary logs are purged given that AUTO_POSITION=1 is used as part of the change master command?
A: This will break replication with error 1236.

Q: Whats the value of show slave status who determines if there is a lag on the slave?
A: This is Seconds_Behind_Master. It’s not always reliable though. For instance if you have a replication setup like A -> B -> C, Seconds_Behind_Master on C will shop the lag relatively to B, not A.

Q: What is the value of saving the history of previous master’s GTIDs executed in the show slave status -> Executed_Gtid_Set?
A: The new replication protocol makes sure that when the slave connects to its master, it sends the range of GTIDs it has already executed. Then the master sends back all other transactions. That’s why Executed_Gtid_Set contains the history of all executed transactions.

Q: We use DB Master and Slave VIPs on our servers, can the mysqlfailover tool also switch the VIP to the new master? Is it scriptable on the event of a failover?
A: Yes you can use extension points to add you own custom scripts with mysqlfailover. See the documentation for –exec-before and –exec-after.

Q: How does mysqlfailover handle brief network instability between the Master and Slaves?
A: mysqlfailover only triggers failover when it suspects the master is no longer alive. So network instability between the master and its slaves won’t affect it for master crash detection. However it can prevent the tool from reconfiguring replication correctly during failover/switchover if one or several slaves are not reachable.

Q: Does Facebook use MySQL with GTID? if yes, which module or all together for everything?
A: I can’t speak for Facebook, but this talk at the MySQL Conference this year suggests that they’re using GTIDs in production. They have added custom code to make GTIDs easier to use.

Q: is GTID_SUBSET function part of MySQL utilities? or we should set a script to regularly detect it? is GTID_SUBSET beneficial in case of an ACTIVE ACTIVE MASTER MASTER setup?
Q: Can you please confirm how to get gtid set in order to use gtid functions?

A: GTID_SUBSET() is a built-in function in MySQL 5.6, you don’t need to install MySQL Utilities to use it. It can be used to easily know whether Executed_Gtid_Set on a given server is a subset of Executed_Gtid_Set on another server, so it can be beneficial to use it in any replication topology.

Q: What is difference between HOLE and BUGS?
A: Holes are not allowed in MySQL 5.6 implementation of GTIDs. So if you see a hole in a GTID sequence, you’re hitting a bug!

Q: Using MySQL utilities, we can set-up replication also using a python script with just one command. Does it automatically takes dump from master to slave and starts replication? If yes so a 300 GB data directory, will it run as background if executed using shell script? Or it just starts the replicationn from current position and won’t take the dump?
A: I think you are talking about mysqlreplicate. This tool only runs CHANGE MASTER TO for you so it doesn’t take a backup of any kind.

Q: Is it possible to use mysqlfailover script at any node(like slave) any time to know which is its MASTER and other SLAVE options also? If no, is this available by some other means?
A: You should probably use mysqlrplshow instead.

Q: As told during limitations of MySQL Utilities on automatic failover, so how can I achieve AUTOMATIC failover if I want this as primary option?
A: The node running mysqlfailover is not highly available so if it is down you lose the ability of doing automatic failover. There are several options if you want to achieve automatic failover: carefully monitor the monitoring node or use solutions like Percona Replication Manager which relies on Pacemaker or Percona XtraDB Cluster which relies on Galera replication.

Q: mysqlrpadmin failover/switchover: Can we execute this command on slave or any other monitoring node?
A: Yes, as long as mysqlrpladmin is installed on a server and if it can connect to the database servers, the command can be executed from anywhere.

Q: If we set slaves to read_only…is that recommended to eliminate the errant transaction?
A: Yes, it is recommended. However it doesn’t prevent users with the SUPER privilege from accidentally writing on a slave.

Q: Can this errant transaction issue be prevented with (active-passive) master-master replication?
A: Master-master replication will make sure that any transaction written on one server will automatically end up being written on the other slave. So it is like avoiding errant transactions. However writing on both masters is not recommended as you can have write conflicts.

Q: Is there any practical advantage of using GTID with MHA?
A: GTIDs provide no way to perform failover, they only simplify how you can reconfigure replication. So using MHA to leverage GTIDs makes sense.

Q: What is diifference GUID and GTID? When do we have to use GUID and GTID?
A: I’m not sure I correctly understand the question. A GTID is made of a source id and a transaction id. The source id is the master’s server_uuid, which is a GUID that is automatically generated when MySQL is started for the first time.

Q: How to check the slave database tables and record? we have to sync data from master to salve database or automatically will be happen?
A: You can use pt-table-checksum and pt-table-sync from Percona Toolkit.

Thanks again for attending the webinar! You can replay it, download the slide – and also access Percona’s vast library of other MySQL webinar recordings here.

The post Using MySQL 5.6 Global Transaction IDs (GTIDs) in production: Q&A appeared first on MySQL Performance Blog.

MySQL 101: Monitor Disk I/O with pt-diskstats

September 4, 2014 - 12:00am

Here on the Percona Support team we often ask customers to retrieve disk stats to monitor disk IO and to measure block devices iops and latency. There are a number of tools available to monitor IO on Linux. iostat is one of the popular tools and Percona Toolkit, which is free, contains the pt-diskstats tool for this purpose. The pt-diskstats tool is similar to iostat but it’s more interactive and contains extended information. pt-diskstats reports current disk activity and shows the statistics for the last second (which by default is 1 second) and will continue until interrupted. The pt-diskstats tool collects samples of /proc/diskstats.

In this post, I will share some examples about how to monitor and check to see if the IO subsystem is performing properly or if any disks are a limiting factor – all this by using the pt-diskstats tool.

pt-diskstats output consists on number of columns and in order to interpret pt-diskstats output we need to know what each column represents.

  • rd_s tells about number of reads per second while wr_s represents number of writes per second.
  • rd_rt and wr_rt shows average response time in milliseconds for reads & writes respectively, which is similar to iostat tool output await column but pt-diskstats shows individual response time for reads and writes at disk level. Just a note, modern iostat splits read and write latency out, but most distros don’t have the latest iostat in their systat (or equivalent) package.
  • rd_mrg and wr_mrg are other two important columns in pt-diskstats output.  *_mrg is telling us how many of the original operations the IO elevator (disk scheduler) was able to merge to reduce IOPS, so *_mrg is telling us a quite important thing by letting us know that the IO scheduler was able to consolidate many or few operations. If rd_mrg/wr_mrg is high% then the IO workload is sequential on the other hand, If rd_mrg/wr_mrg is a low% then IO workload is all random. Binary logs, redo logs (aka ib_logfile*), undo log and doublewrite buffer all need sequential writes.
  • qtime and stime are last two columns in pt-diskstats output where qtime reflects to time spent in disk scheduler queue i.e. average queue time before sending it to physical device and on the other hand stime is average service time which is time accumulated to process the physical device request. Note, that qtime is not discriminated between reads and writes and you can check if response time is higher for qtime than it signal towards disk scheduler. Also note that service time (stime field and svctm field in in pt-diskstats & iostat output respectively) is not reliable on Linux. If you read the iostat manual you will see it is deprecated.

Along with that, there are many other parameters for pt-diskstats – you can found full documentation here. Below is an example of pt-disktats in action. I used the  –devices-regex option which prints only device information that matches this Perl regex.

$ pt-diskstats --devices-regex=sd --interval 5 #ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime 1.1 sda 21.6 22.8 0.5 45% 1.2 29.4 275.5 4.0 1.1 0% 40.0 145.1 65% 158 297.1 155.0 2.1 1.1 sdb 15.0 21.0 0.3 33% 0.1 5.2 0.0 0.0 0.0 0% 0.0 0.0 11% 1 15.0 0.5 4.7 1.1 sdc 5.6 10.0 0.1 0% 0.0 5.2 1.9 6.0 0.0 33% 0.0 2.0 3% 0 7.5 0.4 3.6 1.1 sdd 0.0 0.0 0.0 0% 0.0 0.0 0.0 0.0 0.0 0% 0.0 0.0 0% 0 0.0 0.0 0.0 5.0 sda 17.0 14.8 0.2 64% 3.1 66.7 404.9 4.6 1.8 14% 140.9 298.5 100% 111 421.9 277.6 1.9 5.0 sdb 14.0 19.9 0.3 48% 0.1 5.5 0.4 174.0 0.1 98% 0.0 0.0 11% 0 14.4 0.9 2.4 5.0 sdc 3.6 27.1 0.1 61% 0.0 3.5 2.8 5.7 0.0 30% 0.0 2.0 3% 0 6.4 0.7 2.4 5.0 sdd 0.0 0.0 0.0 0% 0.0 0.0 0.0 0.0 0.0 0% 0.0 0.0 0% 0 0.0 0.0 0.0

These are the stats from 7200 RPM SATA disks. As you can see, the write-response time is very high and most of that is made up of IO queue time. This shows the problem exactly. The problem is that the IO subsystem is not able to handle the write workload because the amount of writes that are being performed are way beyond what it can handle. It means the disks cannot service every request concurrently. The workload would actually depend a lot on where the hot data is stored and as we can see in this particular case the workload only hits a single disk out of the 4 disks. A single 7.2K RPM disk can only do about 100 random writes per second which is not a lot considering heavy workload.

It’s not particularly a hardware issue but a hardware capacity issue. The kind of workload that is present and the amount of writes that are performed per second are not something that the IO subsystem is able to handle in an efficient manner. Mostly writes are generated on this server as can be seen by the disk stats.

Let me show you a second example. Here you can see read latency. rd_rt is consistently between 10ms-30ms. It depends on how fast the disks are spinning and the number of disks. To deal with it possible solutions would be to optimize queries to avoid table scans, use memcached where possible, use SSD’s as it can provide good I/O performance with high concurrency. You will find this post useful on SSD’s from our CEO, Peter Zaitsev.

#ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime 1.0 sdb 33.0 29.1 0.9 0% 1.1 34.7 7.0 10.3 0.1 61% 0.0 0.4 99% 1 40.0 2.2 19.5 1.0 sdb1 0.0 0.0 0.0 0% 0.0 0.0 7.0 10.3 0.1 61% 0.0 0.4 1% 0 7.0 0.0 0.4 1.0 sdb2 33.0 29.1 0.9 0% 1.1 34.7 0.0 0.0 0.0 0% 0.0 0.0 99% 1 33.0 3.5 30.2 1.0 sdb 81.9 28.5 2.3 0% 1.1 14.0 0.0 0.0 0.0 0% 0.0 0.0 99% 1 81.9 2.0 12.0 1.0 sdb1 0.0 0.0 0.0 0% 0.0 0.0 0.0 0.0 0.0 0% 0.0 0.0 0% 0 0.0 0.0 0.0 1.0 sdb2 81.9 28.5 2.3 0% 1.1 14.0 0.0 0.0 0.0 0% 0.0 0.0 99% 1 81.9 2.0 12.0 1.0 sdb 50.0 25.7 1.3 0% 1.3 25.1 13.0 11.7 0.1 66% 0.0 0.7 99% 1 63.0 3.4 11.3 1.0 sdb1 25.0 21.3 0.5 0% 0.6 25.2 13.0 11.7 0.1 66% 0.0 0.7 46% 1 38.0 3.2 7.3 1.0 sdb2 25.0 30.1 0.7 0% 0.6 25.0 0.0 0.0 0.0 0% 0.0 0.0 56% 0 25.0 3.6 22.2

From the below diskstats output it seems that IO is saturated between both reads and writes. This can be noticed with high value for columns rd_s and wr_s. In this particular case, consider having disks in either RAID 5 (better for read only workload) or RAID 10 array is good option along with battery-backed write cache (BBWC) as single disk can really be bad for performance when you are IO bound.

device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime sdb1 362.0 27.4 9.7 0% 2.7 7.5 525.2 20.2 10.3 35% 6.4 8.0 100% 0 887.2 7.0 0.9 sdb1 439.9 26.5 11.4 0% 3.4 7.7 545.7 20.8 11.1 34% 9.8 11.9 100% 0 985.6 9.6 0.8 sdb1 576.6 26.5 14.9 0% 4.5 7.8 400.2 19.9 7.8 34% 6.7 10.9 100% 0 976.8 8.6 0.8 sdb1 410.8 24.2 9.7 0% 2.9 7.1 403.1 18.3 7.2 34% 10.8 17.7 100% 0 813.9 12.5 1.0 sdb1 378.4 24.6 9.1 0% 2.7 7.3 506.1 16.5 8.2 33% 5.7 7.6 100% 0 884.4 6.6 0.9 sdb1 572.8 26.1 14.6 0% 4.8 8.4 422.6 17.2 7.1 30% 1.7 2.8 100% 0 995.4 4.7 0.8 sdb1 429.2 23.0 9.6 0% 3.2 7.4 511.9 14.5 7.2 31% 1.2 1.7 100% 0 941.2 3.6 0.9

The following example reflects write heavy activity but write-response time is very good, under 1ms, which shows disks are healthy and capable of handling high number of IOPS.

#ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime 1.0 dm-0 530.8 16.0 8.3 0% 0.3 0.5 6124.0 5.1 30.7 0% 1.7 0.3 86% 2 6654.8 0.2 0.1 2.0 dm-0 633.1 16.1 10.0 0% 0.3 0.5 6173.0 6.1 36.6 0% 1.7 0.3 88% 1 6806.1 0.2 0.1 3.0 dm-0 731.8 16.0 11.5 0% 0.4 0.5 6064.2 5.8 34.1 0% 1.9 0.3 90% 2 6795.9 0.2 0.1 4.0 dm-0 711.1 16.0 11.1 0% 0.3 0.5 6448.5 5.4 34.3 0% 1.8 0.3 92% 2 7159.6 0.2 0.1 5.0 dm-0 700.1 16.0 10.9 0% 0.4 0.5 5689.4 5.8 32.2 0% 1.9 0.3 88% 0 6389.5 0.2 0.1 6.0 dm-0 774.1 16.0 12.1 0% 0.3 0.4 6409.5 5.5 34.2 0% 1.7 0.3 86% 0 7183.5 0.2 0.1 7.0 dm-0 849.6 16.0 13.3 0% 0.4 0.5 6151.2 5.4 32.3 0% 1.9 0.3 88% 3 7000.8 0.2 0.1 8.0 dm-0 664.2 16.0 10.4 0% 0.3 0.5 6349.2 5.7 35.1 0% 2.0 0.3 90% 2 7013.4 0.2 0.1 9.0 dm-0 951.0 16.0 14.9 0% 0.4 0.4 5807.0 5.3 29.9 0% 1.8 0.3 90% 3 6758.0 0.2 0.1 10.0 dm-0 742.0 16.0 11.6 0% 0.3 0.5 6461.1 5.1 32.2 0% 1.7 0.3 87% 1 7203.2 0.2 0.1

Let me show you a final example. I used –interval and –iterations parameters for pt-diskstats which tells us to wait for a number of seconds before printing the next disk stats and to limit the number of samples respectively. If you notice, you will see in 3rd iteration high latency (rd_rt, wr_rt) mostly for reads. Also, you can notice a high value for queue time (qtime) and service time (stime) where qtime is related to disk IO scheduler settings. For MySQL database servers we usually recommends noop/deadline instead of default cfq.

$ pt-diskstats --interval=20 --iterations=3 #ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime 10.4 hda 11.7 4.0 0.0 0% 0.0 1.1 40.7 11.7 0.5 26% 0.1 2.1 10% 0 52.5 0.4 1.5 10.4 hda2 0.0 0.0 0.0 0% 0.0 0.0 0.4 7.0 0.0 43% 0.0 0.1 0% 0 0.4 0.0 0.1 10.4 hda3 0.0 0.0 0.0 0% 0.0 0.0 0.4 107.0 0.0 96% 0.0 0.2 0% 0 0.4 0.0 0.2 10.4 hda5 0.0 0.0 0.0 0% 0.0 0.0 0.7 20.0 0.0 80% 0.0 0.3 0% 0 0.7 0.1 0.2 10.4 hda6 0.0 0.0 0.0 0% 0.0 0.0 0.1 4.0 0.0 0% 0.0 4.0 0% 0 0.1 0.0 4.0 10.4 hda9 11.7 4.0 0.0 0% 0.0 1.1 39.2 10.7 0.4 3% 0.1 2.7 9% 0 50.9 0.5 1.8 10.4 drbd1 11.7 4.0 0.0 0% 0.0 1.1 39.1 10.7 0.4 0% 0.1 2.8 9% 0 50.8 0.5 1.7 20.0 hda 14.6 4.0 0.1 0% 0.0 1.4 39.5 12.3 0.5 26% 0.3 6.4 18% 0 54.1 2.6 2.7 20.0 hda2 0.0 0.0 0.0 0% 0.0 0.0 0.4 9.1 0.0 56% 0.0 42.0 3% 0 0.4 0.0 42.0 20.0 hda3 0.0 0.0 0.0 0% 0.0 0.0 1.5 22.3 0.0 82% 0.0 1.5 0% 0 1.5 1.2 0.3 20.0 hda5 0.0 0.0 0.0 0% 0.0 0.0 1.1 18.9 0.0 79% 0.1 21.4 11% 0 1.1 0.1 21.3 20.0 hda6 0.0 0.0 0.0 0% 0.0 0.0 0.8 10.4 0.0 62% 0.0 1.5 0% 0 0.8 1.3 0.2 20.0 hda9 14.6 4.0 0.1 0% 0.0 1.4 35.8 11.7 0.4 3% 0.2 4.9 18% 0 50.4 0.5 3.5 20.0 drbd1 14.6 4.0 0.1 0% 0.0 1.4 36.4 11.6 0.4 0% 0.2 5.1 17% 0 51.0 0.5 3.4 20.0 hda 0.9 4.0 0.0 0% 0.2 251.9 28.8 61.8 1.7 92% 4.5 13.1 31% 2 29.6 12.8 0.9 20.0 hda2 0.0 0.0 0.0 0% 0.0 0.0 0.6 8.3 0.0 52% 0.1 98.2 6% 0 0.6 48.9 49.3 20.0 hda3 0.0 0.0 0.0 0% 0.0 0.0 2.0 23.2 0.0 83% 0.0 1.4 0% 0 2.0 1.2 0.3 20.0 hda5 0.0 0.0 0.0 0% 0.0 0.0 4.9 249.4 1.2 98% 4.0 13.2 9% 0 4.9 12.9 0.3 20.0 hda6 0.0 0.0 0.0 0% 0.0 0.0 0.0 0.0 0.0 0% 0.0 0.0 0% 0 0.0 0.0 0.0 20.0 hda9 0.9 4.0 0.0 0% 0.2 251.9 21.3 24.2 0.5 32% 0.4 12.9 31% 2 22.2 10.2 9.7 20.0 drbd1 0.9 4.0 0.0 0% 0.2 251.9 30.6 17.0 0.5 0% 0.7 24.1 30% 5 31.4 21.0 9.5

You can see the busy column in pt-diskstats output which is the same as the util column in iostat – which points to utilization. Actually, pt-diskstats is quite similar to the iostat tool but pt-diskstats is more interactive and has more information. The busy percentage is only telling us for how long the IO subsystem was busy, but is not indicating capacity. So the only time you care about %busy is when it’s 100% and at the same time latency (await in iostat and rd_rt/wr_rt in diskstats output) increases over -say- 5ms. You can estimate capacity of your IO subsystem and then look at the IOPS being consumed (r/s + w/s columns). Also, the system can process more than one request in parallel (in case of RAID) so %busy can go beyond 100% in pt-diskstats output.

If you need to check disk throughput, block device IOPS run the following to capture metrics from your IO subsystem and see if utilization matches other worrisome symptoms. I would suggest capturing disk stats during peak load. Output can be grouped by sample or by disk using the –group-by option. You can use the sysbench benchmark tool for this purpose to measure database server performance. You will find this link useful for sysbench tool details.

$ pt-diskstats --group-by=all --iterations=7200 > /tmp/pt-diskstats.out;


pt-diskstats is one of the finest tools from Percona Toolkit. By using this tool you can easily spot disk bottlenecks, measure the IO subsystem and identify how much IOPS your drive can handle (i.e. disk capacity).

The post MySQL 101: Monitor Disk I/O with pt-diskstats appeared first on MySQL Performance Blog.

Migrating to Percona XtraDB Cluster 2014 edition: Sept. 10 MySQL webinar

September 3, 2014 - 12:31pm

Join me online next week (September 10 at 10 a.m. PDT) for my live webinar on Migrating to Percona XtraDB Cluster.  This was a popular webinar that I gave a few years ago, so I’m doing it again with updates for Percona XtraDB Cluster 5.6 (PXC) and all the latest in the Galera world.

This webinar will be really good for people interested in getting an overview of what PXC/Galera is, what it would take to adopt it for your application, and some of the differences and challenges it brings compared with a conventional MySQL Master/slave setup.  I’d highly suggest attending if you are considering Galera in your environment and want to get a better understanding of its uses and antipatterns.

Additionally, I’ll cover such questions as:

  • What are the requirements for running Percona XtraDB Cluster?
  • Will I have to reload all my tables?
  • How does configuration for the cluster differ from configuring a stand-alone InnoDB server?
  • How should my application interact with the Cluster?
  • Can I use Percona XtraDB Cluster if I only have two MySQL servers currently?
  • How can I move to the Cluster and keep downtime to a minimum?
  • How can I migrate to Percona XtraDB Cluster gradually?

I hope to see you next Wednesday. And please feel free to ask questions in advance in the comments section below. Next week’s live event, like all of our MySQL webinars, is free. Register here!

The post Migrating to Percona XtraDB Cluster 2014 edition: Sept. 10 MySQL webinar appeared first on MySQL Performance Blog.

Reducer.sh – A powerful MySQL test-case simplification/reducer tool

September 3, 2014 - 12:00am

Let me start by saying a big “thank you” to the staff at Oracle for deciding to open source reducer.sh. It’s a tool I developed whilst I was working for them several years ago. Its sole purpose is to do one thing – but do it good: test-case simplification.

So, let’s say some customer just sent you 120,000 lines of SQL code and affirms that “it definitely causes a crash.” Or maybe you ran RQG (the Random Query Generator) for awhile (with the general query log turned on) and now you have a nice SQL trace which may just lead to that crash the run resulted in. Or you’re a DBA testing the company’s usual queries with Valgrind, and noticed that 2 in 1000 queries give a Valgrind warning in the mysqld error log – you’re just not sure which one. Or maybe you’re a developer, and during testing you saw that a SELECT query output did not look the way it should – the output was “7″ where it should have been “5″ – the only problem – you have 1000 lines of INSERT statements and are not sure which one caused it. In all of these cases reducer can help.

Here are some of it’s benefits/features:

  • It can reduce large amount of SQL fast. 40K lines to just a few can usually be done in around 1 hour.
    • Larger files scale even better – the chunking elimination method automatically adapt to file size.
  • It can reduce crashes/asserts, Valgrind testcases, mysqld error log messages, and mysql CLI output testcases
    • Also working (but with complex setup atm) is multi-threaded SQL test-case simplification (ALPHA)
  • It can reduce sporadic testcases for all of the above (i.e. testcases where the issue does not reproduce every time)
    • It can reduce sporadic testcases using multiple threads which significantly improves reduction time
  • It is aware of how to establish a testcase is sporadic (and will report the same) – or not – and will change it’s behavior
  • It is capable (turned on by default) of reducing actual DML/DDL query code after completing line-based reduction
  • It is capable (turned on by default) of reducing testcases by eliminating columns from tables and INSERT queries
  • By default reducer.sh uses tmpfs (highly recommended) to ensure testcases are “as reproducible as possible” (disk I/O)
  • Additional options for mysqld (necessary to reproduce an issue) can easily be listed/added
  • Regex syntax can be used in search strings (where applicable)

So, without further ado, let’s have a look at how to get it to do your simplification job

To get reducer.sh today, use these commands (yum example used, but this can easily be adapted to apt-get):

$ sudo yum install bzr $ cd ~ $ bzr branch lp:randgen $ cd randgen/util/reducer/ $ ls *

(You may also want to checkout ./status.sh in this directory which is a handy tool for seeing what reducer.sh is upto when it is doing it’s first/original attempt to reproduce a given issue.)

And you can get percona-qa (for parse_general_log.sh [and the prepare_reducer.sh code bit if you need it] as shown in the video):

$ cd ~ $ bzr branch lp:percona-qa $ cd perconq-qa $ ls *


The post Reducer.sh – A powerful MySQL test-case simplification/reducer tool appeared first on MySQL Performance Blog.

Using sysbench 0.5 for performing MySQL benchmarks

September 2, 2014 - 8:03am

Given the recent excitement & interest around OpenStack I wanted to make sure I was ready to conduct appropriate evaluations of system performance.  I generally turn to sysbench since it comes with a variety of different tests (accessed via –test= option interface), including:

  • fileio – File I/O test
  • cpu – CPU performance test
  • memory – Memory functions speed test
  • threads – Threads subsystem performance test
  • mutex – Mutex performance test

As you can see, sysbench lets you stress many of the fundamental components of your hardware and infrastructure, such as your disk subsystem, along with your CPUs and memory. An additional option exists that is designed to perform synthetic stress testing of MySQL, and I was surprised when I didn’t see it in the above list on version 0.5, as it used to show up as “oltp – OLTP test”. What happened to –test=oltp ??

This list is from the latest release of sysbench which is 0.5 — you’re only going to be on this version if you build it yourself or if you use the package provided courtesy of Frederic Descamps (thanks lefred!).  If you’re using the version from EPEL, Ubuntu 14.04, or Debian 7 you’re still using version 0.4.12 (check with sysbench –version).  One thing you’ll notice is that the test type of OLTP doesn’t show up anymore.  What gives?  I was scratching my head until I asked on Percona IRC and found out that in 0.5 the standard OLTP test type was replaced with a different syntax, that instead of passing parameters to sysbench you instead reference scripts written in lua.  The advantage here is that now you have an interface in order to write your own specific load tests (provided you know lua, but it isn’t hard).  For those of you looking to run the pre-canned load tests they still exist but you have to have them as part of the RPM install or otherwise copied to your system.

Fortunately if you use the package provided by lefred you’ll find these lua scripts here (this is using Amazon ami as of August 4th, 2014):

[root@pxc-control ~]# ls -l /usr/share/doc/sysbench/tests/db/ total 44 -rw-r--r-- 1 root root 3585 Sep 7 2012 common.lua -rw-r--r-- 1 root root 340 Sep 7 2012 delete.lua -rw-r--r-- 1 root root 830 Sep 7 2012 insert.lua -rw-r--r-- 1 root root 2925 Sep 7 2012 oltp.lua -rw-r--r-- 1 root root 342 Sep 7 2012 oltp_simple.lua -rw-r--r-- 1 root root 425 Sep 7 2012 parallel_prepare.lua -rw-r--r-- 1 root root 343 Sep 7 2012 select.lua -rw-r--r-- 1 root root 3964 Sep 7 2012 select_random_points.lua -rw-r--r-- 1 root root 4066 Sep 7 2012 select_random_ranges.lua -rw-r--r-- 1 root root 343 Sep 7 2012 update_index.lua -rw-r--r-- 1 root root 552 Sep 7 2012 update_non_index.lua

So the trick (if you want to call it that) is that instead of passing a single word to the –test directive, instead you pass the full path to the lua script.

This is the old way (sysbench 0.4.12 from EPEL repo):

--test=oltp --oltp-test-mode=complex

This is the new way (sysbench 0.5):


Here is an example of a test I’m running through haproxy against a 3-node PXC cluster doing the INSERT-only test type so you can see the full syntax I pass to sysbench:

[root@pxc-control ~]# cat sys_haproxy.sh #!/bin/bash sysbench --test=/usr/share/doc/sysbench/tests/db/insert.lua --mysql-host=pxc-control --mysql-port=9999 --mysql-user=sysbench-haproxy --mysql-password=sysbench-haproxy --mysql-db=sbtest --mysql-table-type=innodb --oltp-test-mode=complex --oltp-read-only=off --oltp-reconnect=on --oltp-table-size=1000000 --max-requests=100000000 --num-threads=3 --report-interval=1 --report-checkpoints=10 --tx-rate=24 $1

And here’s what the insert.lua script looks like:

[root@pxc-control ~]# cat /usr/share/doc/sysbench/tests/db/insert.lua pathtest = string.match(test, "(.*/)") or "" dofile(pathtest .. "common.lua") function thread_init(thread_id) set_vars() end function event(thread_id) local table_name local i local c_val local k_val local pad_val table_name = "sbtest".. sb_rand_uniform(1, oltp_tables_count) if (oltp_auto_inc) then i = 0 else i = sb_rand_uniq(1, oltp_table_size) end k_val = sb_rand(1, oltp_table_size) c_val = sb_rand_str([[ ###########-###########-###########-###########-###########-###########-###########-###########-###########-###########]]) pad_val = sb_rand_str([[ ###########-###########-###########-###########-###########]]) rs = db_query("INSERT INTO " .. table_name .. " (id, k, c, pad) VALUES " .. string.format("(%d, %d, '%s', '%s')",i, k_val, c_val, pad_val)) end

The thing that I like most about sysbench 0.5 (beyond the lua interface, of course!) is that it now comes with a –report-interval option (which I generally set as = 1) so that you get output while the script is running. No more waiting until the end of the test to get feedback! Here’s a sample of sysbench 0.5 in action running the INSERT test through a local haproxy instance and writing to three nodes in a PXC cluster such as OpenStack Trove might do:

[root@pxc-control ~]# ./sys_haproxy.sh run sysbench 0.5: multi-threaded system evaluation benchmark Running the test with following options: Number of threads: 3 Report intermediate results every 1 second(s) Random number generator seed is 0 and will be ignored Threads started! [ 1s] threads: 3, tps: 0.00, reads/s: 0.00, writes/s: 1099.28, response time: 9.86ms (95%) [ 2s] threads: 3, tps: 0.00, reads/s: 0.00, writes/s: 973.02, response time: 10.77ms (95%) [ 3s] threads: 3, tps: 0.00, reads/s: 0.00, writes/s: 1181.01, response time: 6.23ms (95%) [ 4s] threads: 3, tps: 0.00, reads/s: 0.00, writes/s: 1103.00, response time: 6.77ms (95%)

I would also like to call your attention to a blog post by Nilnandan Joshi from Percona’s Support team where he describes a method to build sysbench 0.5 on Debian 7.  Thanks Nil for pointing this out!

I hope that helps others out there who upgrade to sysbench 0.5 and then have questions about where –test=oltp went to. I’d love to hear your own sysbench use cases, and whether anyone else is publishing lua scripts for their own load testing!

The post Using sysbench 0.5 for performing MySQL benchmarks appeared first on MySQL Performance Blog.