Latest MySQL Performance Blog posts

You are here

Subscribe to Latest MySQL Performance Blog posts feed
Percona's MySQL & InnoDB performance and scalability blog
Updated: 1 hour 20 min ago

MariaDB: Selective binary logs events

July 31, 2014 - 2:29pm

In the first post in a series on MariaDB features we find interesting, we begin with selectively skipping replication of binlog events. This feature is available on MariaDB 5.5 and 10.

By default when using MySQL’s standard replication, all events are logged in the binary log and those binary log events are replicated to all slaves (it’s possible to filter out some schema). But with this feature, it’s also possible to bypass some events to be replicated on the slave(s) even if they are written in the binary log. Having those event in the binary logs is always useful for point-in-time recovery.

Indeed, usually when we need to not replicate an event, we set sql_log_bin = 0 and the event is bypassed: neither written into the binlog, neither replicated to slave(s).

So with this new feature, it’s possible to just set a session variable to tag events that will be written into the binary log and bypassed on demand on some slaves.

And it’s really easy to use, on the master you do:

set skip_replication=1;

and on the slave(s) having replicate_events_marked_for_skip='FILTER_ON_MASTER' or 'FILTER_ON_SLAVE' the events skipped on the master won’t be replicated.

The valid values for replicate_events_marked_for_skip are:

  • REPLICATE (default) : skipped events are replicated on the slave
  • FILTER_ON_SLAVE : events so marked will be skipped on the slave and not replicated
  • FILTER_ON_MASTER : the filtering will be done on the master so the slave won’t even receive it and then save network bandwidth

That’s a cool feature but when this can be very useful?

Use case:

For archiving this can be very interesting. Indeed most of the time when people is archiving data, they use something like pt-archiver that deletes the data and copy the removed data on an archive server.

Thanks to this feature, instead of having an archiving server where we copy the deleted data, it’s possible to have a slave where we won’t delete the data. This will be much faster (smarter?) and allows to have an archiving server always up to date. Of course in this case sql_log_bin = 0 would have worked (if we ignore the point-in-time recovery).
But with a Galera Cluster? Yes that’s where this feature is really cool, if we would have used sql_log_bin = 0 on a Galera Cluster node, all other nodes would have ignored the delete and the result would be inconsistency between the nodes.

So if you use an asynchronous slave as an archiving server of a Galera Cluster, this feature is really mandatory.

As illustrated below, you can have a MariaDB Galera Cluster node joining a Percona XtraDB Cluster that will be used to delete historical data using pt-archiver:

pt-archiver is started with --set-vars "skip_replication=1"

The post MariaDB: Selective binary logs events appeared first on MySQL Performance Blog.

Percona Server 5.1.73-14.12 is now available

July 31, 2014 - 6:52am

Percona Server version 5.1.73-14.12

Percona is glad to announce the release of Percona Server 5.1.73-14.12 on July 31st, 2014 (Downloads are available here and from the Percona Software Repositories). Based on MySQL 5.1.73, including all the bug fixes in it, Percona Server 5.1.73-14.12 is now the current stable release in the 5.1 series. All of Percona‘s software is open-source and free, all the details of the release can be found in the 5.1.73-14.12 milestone at Launchpad.

NOTE: Packages for Debian 7.0 (wheezy), Ubuntu 13.10 (saucy) and 14.04 (trusty) are not available for this release due to conflict with newer packages available in those releases.

Bugs Fixed:

  • Percona Server couldn’t be built with Bison 3.0. Bug fixed #1262439, upstream #71250.
  • Ignoring Query Cache Comments feature could cause server crash. Bug fixed #705688.
  • Database administrator password could be seen in plain text when debconf-get-selections was executed. Bug fixed #1018291.
  • If XtraDB variable innodb_dict_size was set, the server could attempt to remove a used index from the in-memory InnoDB data dictionary, resulting in a server crash. Bugs fixed #1250018 and #758788.
  • Ported a fix from MySQL 5.5 for upstream bug #71315 that could cause a server crash f a malformed GROUP_CONCAT function call was followed by another GROUP_CONCAT call. Bug fixed #1266980.
  • MTR tests from binary tarball didn’t work out of the box. Bug fixed #1158036.
  • InnoDB did not handle the cases of asynchronous and synchronous I/O requests completing partially or being interrupted. Bugs fixed #1262500 (upstream #54430).
  • Percona Server version was reported incorrectly in Debian/Ubuntu packages. Bug fixed #1319670.
  • Percona Server source files were referencing Maatkit instead of Percona Toolkit. Bug fixed #1174779.
  • The XtraDB version number in univ.i was incorrect. Bug fixed #1277383.

Other bug fixes: #1272732, #1167486, and #1314568.

Release notes for Percona Server 5.1.73-14.12 are available in our online documentation. Bugs can be reported on the launchpad bug tracker.

The post Percona Server 5.1.73-14.12 is now available appeared first on MySQL Performance Blog.

Examining the TokuDB MySQL storage engine file structure

July 30, 2014 - 8:07am

As we know different storage engines in MySQL have different file structures. Every table in MySQL 5.6 must have a .frm file in the database directory matching the table name. But where the rest of the data resides depends on the storage engine.

For MyISAM we have .MYI and .MYD files in the database directory (unless special settings are in place); for InnoDB we might have data stored in the single table space (typically ibdata1 in the database directory) or as file per table (or better said file per partition) producing a single file with .ibd extension for each table/partition. TokuDB as of this version (7.1.7) has its own innovative approach to storing the table contents.

I have created the table in the database test having the following structure:

CREATE TABLE `mytable` ( `id` int(10) unsigned NOT NULL, `c` varchar(15) NOT NULL, `d` int(11) NOT NULL, PRIMARY KEY (`id`), KEY `c` (`c`), KEY `d` (`d`,`c`), KEY `d_2` (`d`) ) ENGINE=TokuDB DEFAULT CHARSET=latin1

No files appear in the “test” database directory besides mytable.frm, however few files are created in the database directory:

-rwxrwx--x 1 mysql mysql 40960 Jul 29 21:01 _test_mytable_key_c_22f19b0_3_19_B_0.tokudb -rwxrwx--x 1 mysql mysql 16896 Jul 29 21:02 _test_mytable_key_d_2_22f223b_3_19_B_0.tokudb -rwxrwx--x 1 mysql mysql 16896 Jul 29 21:01 _test_mytable_key_d_22f1c9a_3_19_B_0.tokudb -rwxrwx--x 1 mysql mysql 32768 Jul 29 21:01 _test_mytable_main_22f1818_2_19.tokudb -rwxrwx--x 1 mysql mysql 65536 Jul 29 21:02 _test_mytable_status_22f1818_1_19.tokudb

As you can see the table is presented by a series of files – the “status” file, the “main” table which contains clustered fractal tree index (primary key) plus each index is stored in its own file. Note how files are named – to include the database name, file name and the key name (the name you give to the key, not the columns involved). This is followed by something like “22f1818_1_19″ which I assume is kind of internal TokuDB object identifier.

Note also (at least in my system) files are created with executable bit set. I see no reason for this and this is probably just a minor bug.

Another minor bug (or intended design limitation?) seems to be TokuDB might loose the actual table name in its file name when you alter the table. For example as I altered the table to drop one of the keys and add another one named “superkey” I see the “mytable” name is replaced with “sql_390c_247″ which looks very much like the temporary table which was used to rebuild the table:

-rwxrwx--x 1 mysql mysql 32768 Jul 29 21:15 _test_sql_390c_247_key_c_22f6f7d_3_19.tokudb -rwxrwx--x 1 mysql mysql 32768 Jul 29 21:15 _test_sql_390c_247_key_d_22f6f7d_5_19.tokudb -rwxrwx--x 1 mysql mysql 32768 Jul 29 21:15 _test_sql_390c_247_key_superkey_22f6f7d_4_19.tokudb -rwxrwx--x 1 mysql mysql 32768 Jul 29 21:14 _test_sql_390c_247_main_22f6f7d_2_19.tokudb -rwxrwx--x 1 mysql mysql 32768 Jul 29 21:14 _test_sql_390c_247_status_22f6f7d_1_19.tokudb

I like the approach of storing different indexes in the different files as this makes it much easier to drop the index as well as potentially allows the placement of indexes onto different storage if it is desired for some reason. However putting all tables in the database root is a bad idea – having substantial amount of tables, especially with few indexes, each producing huge amounts of files, makes it inconvenient to work with the database directory (which often contains other files – log files, binary logs etc.) plus it might push file systems to their limits or performance limits dealing with huge amounts of files in the single directory.

Many also like having files in the data directory as it allows, in basic configurations, to use simple Unix tools such as du to see how much space given database physically takes.

Same as InnoDB and MyISAM, TokuDB will create a separate set of files for each partition, placing it in the same directory, having same table partitioned by HASH on primary key with 4 partitions I observe:

-rwxrwx--x 1 mysql mysql 32768 Jul 29 21:22 _test_sql_390c_25b_P_p0_key_c_22f9f35_3_19.tokudb -rwxrwx--x 1 mysql mysql 32768 Jul 29 21:22 _test_sql_390c_25b_P_p0_key_d_22f9f35_5_19.tokudb -rwxrwx--x 1 mysql mysql 32768 Jul 29 21:22 _test_sql_390c_25b_P_p0_key_superkey_22f9f35_4_19.tokudb -rwxrwx--x 1 mysql mysql 32768 Jul 29 21:22 _test_sql_390c_25b_P_p0_main_22f9f35_2_19.tokudb -rwxrwx--x 1 mysql mysql 32768 Jul 29 21:22 _test_sql_390c_25b_P_p0_status_22f9f35_1_19.tokudb -rwxrwx--x 1 mysql mysql 32768 Jul 29 21:22 _test_sql_390c_25b_P_p1_key_c_22f9f8e_3_19.tokudb -rwxrwx--x 1 mysql mysql 32768 Jul 29 21:22 _test_sql_390c_25b_P_p1_key_d_22f9f8e_5_19.tokudb -rwxrwx--x 1 mysql mysql 32768 Jul 29 21:22 _test_sql_390c_25b_P_p1_key_superkey_22f9f8e_4_19.tokudb -rwxrwx--x 1 mysql mysql 32768 Jul 29 21:22 _test_sql_390c_25b_P_p1_main_22f9f8e_2_19.tokudb -rwxrwx--x 1 mysql mysql 32768 Jul 29 21:22 _test_sql_390c_25b_P_p1_status_22f9f8e_1_19.tokudb -rwxrwx--x 1 mysql mysql 32768 Jul 29 21:22 _test_sql_390c_25b_P_p2_key_c_22f9fe1_3_19.tokudb -rwxrwx--x 1 mysql mysql 32768 Jul 29 21:22 _test_sql_390c_25b_P_p2_key_d_22f9fe1_5_19.tokudb -rwxrwx--x 1 mysql mysql 32768 Jul 29 21:22 _test_sql_390c_25b_P_p2_key_superkey_22f9fe1_4_19.tokudb -rwxrwx--x 1 mysql mysql 32768 Jul 29 21:22 _test_sql_390c_25b_P_p2_main_22f9fe1_2_19.tokudb -rwxrwx--x 1 mysql mysql 32768 Jul 29 21:22 _test_sql_390c_25b_P_p2_status_22f9fe1_1_19.tokudb -rwxrwx--x 1 mysql mysql 32768 Jul 29 21:22 _test_sql_390c_25b_P_p3_key_c_22f9ffb_3_19.tokudb -rwxrwx--x 1 mysql mysql 32768 Jul 29 21:22 _test_sql_390c_25b_P_p3_key_d_22f9ffb_5_19.tokudb -rwxrwx--x 1 mysql mysql 32768 Jul 29 21:22 _test_sql_390c_25b_P_p3_key_superkey_22f9ffb_4_19.tokudb -rwxrwx--x 1 mysql mysql 32768 Jul 29 21:22 _test_sql_390c_25b_P_p3_main_22f9ffb_2_19.tokudb -rwxrwx--x 1 mysql mysql 32768 Jul 29 21:22 _test_sql_390c_25b_P_p3_status_22f9ffb_1_19.tokudb

As you can see “P_p1″ to “P_p2″ suffixes added to each of the files.

What other files exist beyond those which come from TokuDB tables?

There are few system files

-rwxrwx--x 1 mysql mysql 32768 Jul 29 21:16 tokudb.directory -rwxrwx--x 1 mysql mysql 16384 Jul 17 19:09 tokudb.environment -rwxrwx--x 1 mysql mysql 1048576 Jul 29 21:22 tokudb.rollback

Which as their name say contain “directory” – metadata about tables and indexes in the system, rollback – contains data which is needed for transaction rollback and environment contains some kind of information about environment. I did not dig into this – I just see this file is not changed after I have started the instance unlike other 2 which are changed when table structure is changed (tokudb.directory) or when database is modified (tokudb.rollback)

Next you will see

-rw------- 1 mysql mysql 0 Jul 17 19:09 __tokudb_lock_dont_delete_me_data -rw------- 1 mysql mysql 0 Jul 17 19:09 __tokudb_lock_dont_delete_me_environment -rw------- 1 mysql mysql 0 Jul 17 19:09 __tokudb_lock_dont_delete_me_logs -rw------- 1 mysql mysql 0 Jul 17 19:09 __tokudb_lock_dont_delete_me_recovery -rw------- 1 mysql mysql 0 Jul 17 19:09 __tokudb_lock_dont_delete_me_temp

I see TokuDB really tries to lock the database instance preventing concurrent access to the directory with multiple files. I think InnoDB does it more clean way placing the lock on system tablespace which does not require extra files… though the TokuDB team might have some specific reason to do it this way. Might be Oracle holds the software patent on preventing concurrent database operation by locking the file?

Finally there is the transaction log file:

-rwx------ 1 mysql mysql 37499084 Jul 29 21:34 log000000002593.tokulog25

TokuDB transaction log files are not pre-allocated like InnoDB’s but they look more similar to MySQL binary logs – they have sequentially increasing file numbers which will increment as new files are created, file itself will grow as new data is written to the log. As I understand log rotation happens during checkpoint and you would typically see only one log file.

There is a lot more for me to learn when it comes to TokuDB file layout and purpose of individual files, yet I hope this provides you with good basic overview of the TokuDB MySQL storage engine.

The post Examining the TokuDB MySQL storage engine file structure appeared first on MySQL Performance Blog.

Prevent MySQL downtime: Set max_user_connections

July 29, 2014 - 5:05am

One of the common causes of downtime with MySQL is running out of connections. Have you ever seen this error? “ERROR 1040 (00000): Too many connections.” If you’re working with MySQL long enough you surely have. This is quite a nasty error as it might cause complete downtime… transient errors with successful transactions mixed with failing ones as well as only some processes stopping to run properly causing various kinds of effects if not monitored properly.

There are number of causes for running out of connections, the most common ones involving when the Web/App server is creating unexpectedly large numbers of connections due to a miss-configuration or some script/application leaking connections or creating too many connections in error.

The solution I see some people employ is just to increase max_connections to some very high number so MySQL “never” runs out of connections. This however can cause resource utilization problems – if a large number of connections become truly active it may use a lot of memory and cause the MySQL server to swap or be killed by OOM killer process, or cause very poor performance due to high contention.

There is a better solution: use different user accounts for different scripts and applications and implement resource limiting for them. Specifically set max_user_connections:

mysql> GRANT USAGE ON *.* TO 'batchjob1'@'localhost' -> WITH MAX_USER_CONNECTIONS 10;

This approach (available since MySQL 5.0) has multiple benefits:

Security – different user accounts with only required permissions make your system safer from development errors and more secure from intruders
Preventing Running out of Connections – if there is a bug or miss-configuration the application/script will run out of connections of course but it will be the only part of the system affected and all other applications will be able to use the database normally.
Overload Protection – Additional numbers of connections limits how much queries you can run concurrently. Too much concurrency is often the cause of downtime and limiting it can reduce the impact of unexpected heavy queries running concurrently by the application.

In addition to configuring max_user_connections for given accounts you can set it globally in my.cnf as “max_user_connections=20.” This is too coarse though in my opinion – you’re most likely going to need a different number for different applications/scripts. Where max_user_connections is most helpful is in multi-tenant environments with many equivalent users sharing the system.

The post Prevent MySQL downtime: Set max_user_connections appeared first on MySQL Performance Blog.

What I learned while migrating a customer MySQL installation to Amazon RDS

July 28, 2014 - 4:00am

Hi, I recently had the experience of assisting with a migration of a customer MySQL installation to Amazon RDS (Relational Database Service). Amazon RDS is a great platform for hosting your MySQL installation and offers the following list of pros and cons:

  • You can scale your CPU, IOPS, and storage space separately by using Amazon RDS. Otherwise you need to take downtime and upgrade physical components of a rack-mounted server.
  • Backups, software version patching, failure detection, and (some) recovery is automated with Amazon RDS.
  • You lose shell access to your DB instance
  • You lose SUPER privilege for regular users. Many SUPER-type statements and commands are provided for as a Stored Procedure.
  • It is easy to set up multiple read replicas (slaves in READ_ONLY=1 mode).
  • You can set up a secondary sychronous instance for failover in the event your primary instance fails.

While this article is written to be Amazon RDS-specific it also has implications for any sort of migration.

  1. The only way to interface with RDS is through mysql client, which means loading data must be done using SQL. This means you need to use mysqldump or mydumper, which can be a large endeavour should your dataset be north of 500GB — this is a lot of single threaded activity!  Think about not only how long dumping and loading will take, but also factor in how much time it will take for replication to catch up on the hours/days/weeks your dumping and loading procedure took.  You might need to allocate more disk space and Provisioned IOPS to your RDS node in order to improve disk throughput, along with a change to innodb_flush_log_at_trx_commit, and sync_binlog.
  2. RDS is set to UTC (system_time_zone=UTC) and this cannot be changed as in Parameter Groups you will see that default_time_zone is set as Modifiable=false. This can bite you if you are planning to use RDS as a slave for a short while before failing the application over to Amazon RDS.  If you have configured binlog_format=STATEMENT on your master and you have TIMESTAMP columns, this will lead to differences in RDS data set for absolute values ’2014-07-24 10:15:00′ vs NOW(). It is also a concern for the Developer who may not be explicitly declaring their MySQL connections to set an appropriate time zone. Often the best piece of advice can be to leave all database data in UTC no matter where the server is physically located, and deal with localization at the presentation layer.
  3. Amazon RDS by default has max_allowed_packet=1MB. This is pretty low as most other configs are 16MB so if you’re using extended-insert (by default, you are), the size of each insert statement will be close to 16MB and thus can lead to errors related to “packet too big” on Amazon RDS side, thus failing out an import.
  4. Amazon RDS does not support the SUPER privilege for regular users. For example, this becomes quite a challenge as many tools (Percona Toolkit) are authored to assume you have SUPER-level access on all nodes — simple tasks become vastly more complicated as you need to think of clever workarounds (I’m looking at you pt-table-sync!).
  5. Triggers and views thus cannot be applied using the default mysqldump syntax which includes SQL DEFINER entries — these lines are there so that a user with SUPER can “grant” another user ability to execute the trigger/view. Your load will fail if you forget this.
  6. Consider running your load with –force to the mysql client, and log to disk stderr/stdout so you can review errors later. It is painful to spend 4 days loading a 500GB database only to have it fail partially through because you forgot about SQL DEFINER issue..
  7. Consider splitting the mysqldump into two phases: –no-data so you dump schema only, and then –data-only so you get just the rows. This way you can isolate faults and solve them along the way.
  8. Skipping replication events is SLOW. You don’t have ability to do sql_slave_skip_counter (since this requires SUPER), instead you need to use an Amazon RDS function of mysql.rds_skip_repl_error. Sadly this Stored Procedure takes no argument and thus it only skips one event at a time. It takes about 2-3 seconds for each execution, so if you have a lot of events to skip, that’s a problem. Having to skip ANYTHING is indication that something went wrong in the process, so if you find yourself in the unenviable position of skipping events, know that pt-table-checksum should be able to give you an idea how widespread is the data divergence issue.
  9. pt-table-sync doesn’t work against Amazon RDS as it is written to expect SUPER because it wants to do binlog_format=STATEMENT in session, but that’s not allowed. Kenny Gryp hacked me a version to just skip this check, and Kenny also reported it for inclusion in a future release of Percona Toolkit, but in the meantime you need to work around the lack of SUPER privilege.
  10. pt-table-sync is SLOW against RDS. As pt-table-sync doesn’t log a lot of detail about where time is spent, I haven’t completely isolated the source of the latency, but I suspect this is more about network round trip than anything else.
  11. innodb_log_file_size is hardcoded to 128MB in Amazon RDS, you can’t change this.  innodb_log_files_in_group is not even showing up in Parameter Groups view but SHOW GLOBAL VARIABLES reports as 2. So you’re cookin’ on 256MB, if your writes are heavy this may become a bottleneck with little workaround available in MySQL.
  12. CHANGE MASTER isn’t available in RDS. You define RDS as a slave by calling a stored procedure where you pass the appropriate options such as CALL mysql.rds_set_external_master.

For those of you wondering about the SUPER-privilege, I was fortunate that Bill Karwin from Percona’s Support team took the time to review my post and suggested I dig into this deeper, turns out that Amazon didn’t hack MySQL to remove the SUPER privilege, but instead run the Stored Procedures with security_type of DEFINER:

mysql> select db,name,type,language,security_type,definer from proc where name = 'rds_external_master' G *************************** 1. row *************************** db: mysql name: rds_external_master type: PROCEDURE language: SQL security_type: DEFINER definer: rdsadmin@localhost 1 row in set (0.08 sec)

mysql> show grants for 'rdsadmin'@'localhost'; +------------------------------------------------------------------------------------------------------+ | Grants for rdsadmin@localhost | +------------------------------------------------------------------------------------------------------+ | GRANT ALL PRIVILEGES ON *.* TO 'rdsadmin'@'localhost' IDENTIFIED BY PASSWORD 'XXX' WITH GRANT OPTION | +------------------------------------------------------------------------------------------------------+ 1 row in set (0.07 sec)

So for those of you working with Amazon RDS, I hope that this list saves you some time and helps our your migration!  If you get stuck you can always contact Percona Consulting for assistance.

The post What I learned while migrating a customer MySQL installation to Amazon RDS appeared first on MySQL Performance Blog.

Monitoring MySQL flow control in Percona XtraDB Cluster 5.6

July 25, 2014 - 7:41am

Monitoring flow control in a Galera cluster is very important. If you do not, you will not understand why writes may sometimes be stalled. Percona XtraDB Cluster 5.6 provides 2 status variables for such monitoring: wsrep_flow_control_paused and wsrep_flow_control_paused_ns. Which one should you use?

What is flow control?

Flow control does not exist with regular MySQL replication, but only with Galera replication. It is simply the mechanism nodes are using when they are not able to keep up with the write load: to keep replication synchronous, the node that is starting to lag instructs the other nodes that writes should be paused for some time so it does not get too far behind.

If you are not familiar with this notion, you should read this blogpost.

Triggering flow control and graphing it

For this test, we’ll use a 3-node Percona XtraDB Cluster 5.6 cluster. On node 3, we will adjust gcs.fc_limit so that flow control is triggered very quickly and then we will lock the node:

pxc3> set global wsrep_provider_options="gcs.fc_limit=1"; pxc3> flush tables with read lock;

Now we will use sysbench to insert rows on node 1:

$ sysbench --test=oltp --oltp-table-size=50000 --mysql-user=root --mysql-socket=/tmp/pxc1.sock prepare

Because of flow control, writes will be stalled and sysbench will hang. So after some time, we will release the lock on node 3:

pxc3> unlock tables;

During the whole process, wsrep_flow_control_paused and wsrep_flow_control_paused_ns are recorded every second with mysqladmin ext -i1. We can then build a graph of the evolution of both variables:

While we can clearly see when flow control was triggered on both graphs, it is much easier to know when flow control was stopped with wsrep_flow_control_paused_ns. It would be even more obvious if we have had several timeframes when flow control is in effect.


Monitoring a server is obviously necessary if you want to be able to catch issues. But you need to look at the right metrics. So don’t be scared if you are seeing that wsrep_flow_control_paused is not 0: it simply means that flow control has been triggered at some point since the server started up. If you want to know what is happening right now, prefer wsrep_flow_control_paused_ns.

The post Monitoring MySQL flow control in Percona XtraDB Cluster 5.6 appeared first on MySQL Performance Blog.

Putting MySQL Fabric to Use: July 30 webinar

July 24, 2014 - 3:13pm

Martin and I have recently been blogging together about MySQL Fabric (in case you’ve missed this, you can find the first post of the series here), and on July 30th, we’re going to be presenting a webinar on this topic titled “Putting MySQL Fabric to Use.”

The focus of the webinar is to help you get started quickly on this technology, so we’ll include very few slides (mostly just a diagram or two) and then jump straight into shared screen mode, with lots of live console and source code examples.

In order to make the best use of time, we won’t show you how to install and configure MySQL Fabric. However, we can point you to a few resources to help you get ready and even follow our examples as we go:

  • The official manual is an obvious starting point
  • Our second post in the series includes configuration instructions
  • This git repo contains the test environment we’ll use to run our demos. Specifically, we’ll use the sharding branch, so if you intend to follow our examples as we go, we recommend checking that one out.

If you’re interested, you can register for this webinar here, and if there’s something specific you’d like to see (we had a request for PHP examples in the comments to our last post) feel free to post that as a comment. We can’t promise we’ll be able to cover all requests during the webinar, but we’ll incorporate examples to the repo as time allows.

Hope to see you then!

The post Putting MySQL Fabric to Use: July 30 webinar appeared first on MySQL Performance Blog.

DBaaS, OpenStack and Trove 101: Introduction to the basics

July 24, 2014 - 12:00am

We’ll be publishing a series of posts on OpenStack and Trove over the next few weeks, diving into their usage and purpose. For readers who are already familiar with these technologies, there should be no doubt as to why we are incredibly excited about them, but for those who aren’t, consider this a small introduction to the basics and concepts.

What is Database as a Service (DBaaS)?
In a nutshell, DBaaS – as it is frequently referred to – is a loose moniker to the concept of providing a managed cloud-based database environment accessible by users, applications or developers. Its aim is to provide a full-fledged database environment, while minimizing the administrative turmoil and pains of managing the surrounding infrastructure.

Real life example: Imagine you are working on a new application that has to be accessible from multiple regions. Building and maintaining a large multiregion setup can be very expensive. Furthermore, it introduces additional complexity and strain on your system engineers once timezones start to come into play. The challenge of having to manage machines in multiple datacenters won’t simplify your release cycle, nor increase your engineers’ happiness.

Let’s take a look at some of the questions DBaaS could answer in a situation like this:

- How do I need to size my machines, and where should I locate them?
Small environments require less computing power and can be a good starting point, although this also means they may not be as well-prepared for future growth. Buying larger-scale and more expensive hardware and hosting can be very expensive and can be a big stumbling block for a brand new development project. Hosting machines in multiple DC’s could also introduce administrative difficulties, like having different SLA’s and potential issues setting up WAN or VPN communications. DBaaS introduces an abstraction layer, so these consideration aren’t yours, but those of the company offering it, while you get to reap all the rewards.

- Who will manage my environment from an operational standpoint?
Staffing considerations and taking on the required knowledge to properly maintain a production database are often either temporarily sweeped under the rug or, when the situation turns out badly, a cause for the untimely demise of quite a few young projects. Rather than think about how long ago you should have applied that security patch, wouldn’t it be nice to just focus on managing the data itself, and be otherwise confident that the layers beyond it are managed responsibly?

- Have a sudden need to scale out?
Once you’re up and running, enjoying the success of a growing use base, your environment will need to scale accordingly. Rather than think long and hard on the many options available, as well as the logistics attached to those changes, your DBaaS provider could handle this transparently.

Popular public options: Here are a few names of public services you may have come across already that fall under the DBaaS moniker:

- Amazon RDS
- Rackspace cloud databases
- Microsoft SQLAzure
- Heroku
- Clustrix DBaaS

What differentiates these services from a standard remote database is the abstraction layer that fully automates their backend, while still offering an environment that is familiar to what your development team is used to (be it MySQL, MongoDB, Microsoft SQLServer, or otherwise). A big tradeoff to using these services is that you are effectively trusting an external company with all of your data, which might make your legal team a bit nervous.

Private cloud options?
What if you could offer your team the best of both worlds? Or even provide a similar type of service to your own customers? Over the years, a lot of platforms have been popping up to allow effective management and automation of virtual environments such as these, allowing you to effectively “roll your own” DBaaS. To get there, there are two important layers to consider:

  • Infrastructure Management, also referred to as Infrastructure-as-a-Service (IaaS), focusing on the logistics of spinning up virtual machines and keeping their required software packages running.
  • Database Management, previously referred to DBaaS, transparently coordinating multiple database instances to work together and present themselves as a single, coherent data repository.

Examples of IaaS products:
- OpenStack
- OpenQRM

Ecample of DBaaS:
- Trove

Main Advantages of DBaaS
For reference, the main reasons why you might want to consider using an existing DBaaS are as follows:

- Reduced Database management costs

DBaaS removes the amount of maintenance you need to perform on isolated DB instances. You offload the system administration of hardware, OS and database to either a dedicated service provider, or in the case where you are rolling your own, allow your database team to more efficiently manage and scale the platform (public vs private DBaaS).

- Simplifies certain security aspects

If you are opting to use a DBaaS platform, the responsibility of worrying about this or that patch being applied falls to your service provider, and you can generally assume that they’ll keep your platform secure from the software perspective.

- Centralized management

One system to rule them all. A guarantee of no nasty surprises concerning that one ancient server that should have been replaced years ago, but you never got around to it. As a user of DBaaS, all you need to worry about is how you interface with the database itself.

- Easy provisioning

Scaling of the environment happens transparently, with minimal additional management.

- Choice of backends

Typically, DBaas providers offer you the choice of a multitude of database flavors, so you can mix and match according to your needs.

Main Disadvantages
- Reduced visibility of the backend

Releasing control of the backend requires a good amount of trust in your DBaaS provider. There is limited or no visibility into how backups are run and maintained, which configuration modifications are applied, or even when and which updates will be implemented. Just as you offload your responsibilities, you in turn need to rely on an SLA contract.

- Potentially harder to recover from catastrophic failures

Similarly to the above, unless your service providers have maintained thorough backups on your behalf, the lack of direct access to the host machines means that it could be much harder to recover from database failure.

- Reduced performance for specific applications

There’s a good chance that you are working on a shared environment. This means the amount of workload-specific performance tuning options is limited.

- Privacy and Security concerns

Although it is much easier to maintain and patch your environment. Having a centralized system also means you’re more prone to potential attacks targeting your dataset. Whichever provider you go with, make sure you are intimately aware of the measures they take to protect you from that, and what is expected from your side to help keep it safe.

Conclusion: While DBaaS is an interesting concept that introduces a completely new way of approaching an application’s database infrastructure, and can bring enterprises easily scalable, and financially flexible platforms, it should not be considered a silver bullet. Some big tradeoffs need to be considered carefully from the business perspective, and any move there should be accompanied with careful planning and investigation of options.

Embracing the immense flexibility these platforms offer, though, opens up a lot of interesting perspectives too. More and more companies are looking at ways to roll their own “as-a-Service”, provisioning completely automated hosted platforms for customers on-demand, and abstracting their management layers to allow them to be serviced by smaller, highly focused technical teams.

Stay tuned: Over the next few weeks we’ll be publishing a series of posts focusing on the combination of two technologies that allow for this type of flexibility: OpenStack and Trove.

The post DBaaS, OpenStack and Trove 101: Introduction to the basics appeared first on MySQL Performance Blog.

Why TokuDB hates Transparent HugePages

July 23, 2014 - 3:00am

If you try to install the TokuDB storage engine on a modern Linux distribution it might fail with following error message:

2014-07-17 19:02:55 13865 [ERROR] TokuDB will not run with transparent huge pages enabled.
2014-07-17 19:02:55 13865 [ERROR] Please disable them to continue.
2014-07-17 19:02:55 13865 [ERROR] (echo never > /sys/kernel/mm/transparent_hugepage/enabled)

You might be curious why TokuDB refuses to start with Transparent HugePages. Are they not a good thing… allowing smaller kernel page tables and less TLB misses when accessing data in the buffer pool? I was curious, so I asked Tim Callaghan this very question.

This problem originates with TokuDB using jemalloc memory allocator, which uses a particular trick to deal with memory fragmentation. The classical problem with memory allocators is fragmentation – if you allocated a say 2MB chunk from the operating system (typically using mmap),  as the process runs it is likely some of that 2MB memory block will become free but not all of it, hence it can’t be given back to operating system completely. jemalloc uses a clever trick being able to give back portions of memory allocated in such a way through madvise(…, MADV_DONTNEED) call.

Now what happens when you use transparent huge pages? In this case the operating system (and CPU, really) works with pages of a much larger size which only can be unmapped from the address space in its entirety – which does not work when smaller objects are freed which produce smaller free “holes.”

As a result, without being able to free memory efficiently the amount of allocated memory may grow unbound until the process starts to swap out – and in the end being killed by “out of memory” killer at least under some workloads. This is not a behavior you want to see from the database server. As such requiring to disable huge pages is a better choice.

Having said that this is pretty crude requirement/solution – disabling huge pages on complete operating system image to make one application work while others might be negatively impacted. I hope with a future jemalloc version/kernel releases there will be solution where jemalloc simply prevents huge pages usage for its allocations.

Using jmalloc and its approach to remove pages from resident space also makes TokuDB a lot different than typical MySQL instances running Innodb from the process space. With Innodb VSZ and RSS are often close. In fact we often monitor VSZ to ensure it is not excessively large to avoid danger of process starting to swap actively or be killed with OOM killer. TokuDB however often can look like this

[root@smt1 mysql]# ps aux | grep mysqld
mysql 14604 21.8 50.6 12922416 4083016 pts/0 Sl Jul17 1453:27 /usr/sbin/mysqld –basedir=/usr –datadir=/var/lib/mysql –plugin-dir=/usr/lib64/mysql/plugin –user=mysql –log-error=/var/lib/mysql/smt1.pz.percona.com.err –pid-file=/var/lib/mysql/smt1.pz.percona.com.pid
root 28937 0.0 0.0 103244 852 pts/2 S+ 10:38 0:00 grep mysqld

In this case TokuDB is run with defaults on 8GB system – it takes approximately 50% of memory in terms of RSS size, however the VSZ of the process is over 12GB – this is a lot more than memory available.

This is completely fine for TokuDB. If I would not have Transparent HugePages disabled, though, my RSS would be a lot closer to VSZ causing intense swapping or even process killed by OOM killer.

In addition to explaining this to me, Tim Callaghan was also kind enough to share some links on this issue from other companies such as Oracle, NuoDB , Splunk, SAP, SAP(2), which provide more background information on this topic.

The post Why TokuDB hates Transparent HugePages appeared first on MySQL Performance Blog.

Reference architecture for a write-intensive MySQL deployment

July 22, 2014 - 7:31am

We designed Percona Cloud Tools (both hardware and software setup) to handle a very high-intensive MySQL write workload. For example, we already observe inserts of 1bln+ datapoints per day. So I wanted to share what kind of hardware we use to achieve this result.

Let me describe what we use, and later I will explain why.


  • Chassis: Supermicro SC825TQ-R740LPB 2U Rackmount Chassis
  • Motherboard: Supermicro X9DRI-F dual socket
  • CPU: Dual Intel Xeon Ivy Bridge E5-2643v2 (6x 3.5Ghz cores, 12x HT cores, 25M L3)
  • Memory: 256GB (16x 16GB 256-bit quad-channel) ECC registered DDR3-1600
  • Raid: LSI MegaRAID 9260-4i 4-port 6G/s hardware RAID controller, 512M buffer
  • MainStorage: PCIe SSD HGST FlashMAX II 4.8TB
  • Secondary Storage (OS, logs): RAID 1 over 2x 3TB hard drives


When selecting hardware for your application, you need to look at many aspects – typically you’re looking for a solution for which you already have experience in working with and has also proved to be the most efficient option. For us it has been as follows:

Cloud vs Bare Metal
We have experience having hardware hosted at the data center as well as cash for upfront investments in hardware so we decided to go for physical self-hosted hardware instead of the cloud. Going this route also gave us maximum flexibility in choosing a hardware setup that was the most optimal for our application rather than selecting one of the stock options.

Scale Up vs Scale Out
We have designed a system from scratch to be able to utilize multiple servers through sharding – so our main concern is choosing the most optimal configuration for the server and provisioning servers as needed. In addition to raw performance we also need to consider power usage and overhead of managing many servers which typically makes having slightly more high-end hardware worth it.

Resource Usage
Every application uses resources in different ways so an optimal configuration will be different depending on your application. Yet all applications use the same resources you need to consider. Typically you want to plan for all of your resources to be substantially used – providing some margin for spikes and maintenance.


  • Our application processes a lot of data and uses the TokuDB storage engine which uses a lot of CPU for compression, so we needed powerful CPUs.
  • Many MySQL functions are not parallel, think executing single query or Alter table so we’re going for CPU with faster cores rather than larger amount of cores. The resulting configuration with 2 sockets giving 12 cores and 24 threads is good enough for our workloads.
  • Lower end CPUs such as Xeon E3 have very attractive price/performance but only support 32GB of memory which was not enough for our application.


  • For database boxes memory is mainly used as a cache, so depending on your application you may be better off investing in memory or storage for optimal performance. Check out this blog post for more details.
  • Accessing data in memory is much faster than even on the fastest flash storage so it is still important.
    For our workload having recent data in memory is very important so we get as much “cheap” memory as we can populating all 16 slots with 16GB dimms which have attractive cost per GB at this point.

There are multiple uses for the storage so there are many variables to consider

  • Bandwidth
    • We need to be able access data on the storage device quickly and with stable response time. HGST FlashMax II has been able to meet these very demanding needs.
  • Endurance
    • When using flash storage you need to worry about endurance – how much beating with writes flash storage can handle before it wears out. Some low cost MLC SSDs would wear out in the time frame of weeks if being written with maximum speed. HGST FlashMax II has endurance rating of 10 Petabytes written (for a random workload) – 30 Petabytes written (for a sequential workload)
    • We also use TokuDB storage engine which significantly reduces amount of writes compared to Innodb.
  • Durability
    • Does the storage provide true durability with data guaranteed to be persisted when write is acknowledged at the operating system level when power goes down or is loss possible?
      We do not want to risk database corruption in case of power failure so we were looking for storage solution which guarantees durability.
      HGST FlashMax II guarantees durability which has been confirmed by our stress tests.
  • Size
    • To scale application storage demands you need to scale both number of IO operations storage can handle and storage size. For flash storage it is often the size which becomes limiting factor.
      HGST FlashMax II 4.8 TB capacity is best available on the market which allows us to go “All Flash” and achieve very quick data access to all our data set.
  • Secondary Storage
    • Not every application need requires flash storage properties.
    • We have secondary storage with conventional drives for operating system and logs.
      Sequential read/write pattern works well with low cost conventional drives and also allow us to increase flash life time, having it handling less writes.
    • We’re using RAID with BBU for secondary storage to be able to have fully durable binary logs without paying high performance penalty.

There are arguments that SATA SSD provides just a good enough performance for MySQL and there is no need for PCIe. While these arguments are valid in one dimension, there are several more to consider.

First, like I said PCIe SSD still provides a best absolute response time and it is an important factor for an end user experience in SaaS systems like Percona Cloud Tools.
Second, consider maintenance operations like backup, ALTER TABLES or slave setups. While these operations are boring and do not get as much attention as a response time or throughput in benchmarks, it is still operations that DBAs performs basically daily, and it is very important to finish a backup or ALTER TABLE in a predictable time, especially on 3-4TB datasize range. And this is where PCIe SSD performs much better than SATA SSDs. For SATA SSD, especially bigger size, write endurance is another point of concern.

Why TokuDB engine?
The TokuDB engine is the best when it comes to insert operations to a huge dataset, and few more factors makes it a no-brainer:

  • TokuDB compression is a huge win. I estimate into this storage ( FlashMAX II 4.8TB) we will fit about 20-30TB of raw data.
  • TokuDB is SSD friendly, as it performs much less data writes per INSERT operation than InnoDB, which greatly extends SSD (which is, well, expensive to say the least) lifetime.

The post Reference architecture for a write-intensive MySQL deployment appeared first on MySQL Performance Blog.

Percona XtraDB Cluster 5.6.19-25.6 is now available

July 21, 2014 - 12:24pm

Percona is glad to announce the new release of Percona XtraDB Cluster 5.6 on July 21st 2014. Binaries are available from downloads area or from our software repositories. We’re also happy to announce that Ubuntu 14.04 LTS users can now download, install, and upgrade Percona XtraDB Cluster 5.6 from Percona’s software repositories.

Based on Percona Server 5.6.19-67.0 including all the bug fixes in it, Galera Replicator 3.6, and on Codership wsrep API 25.6, Percona XtraDB Cluster 5.6.19-25.6 is now the current General Availability release. All of Percona‘s software is open-source and free, and all the details of the release can be found in the 5.6.19-25.6 milestone at Launchpad.

New Features:

  • Percona XtraDB Cluster now supports storing the Primary Component state to disk by setting the pc.recovery variable to true. The Primary Component can then recover automatically when all nodes that were part of the last saved state reestablish communications with each other. This feature can be used for automatic recovery from full cluster crashes, such as in the case of a data center power outage and graceful full cluster restarts without the need for explicitly bootstrapping a new Primary Component.
  • When joining the cluster, the state message exchange provides us with gcache seqno limits. That information is now used to choose a donor through IST first, and, if this is not possible, only then SST is attempted. The wsrep_sst_donor setting is honored, though, and it is also segment aware.
  • An asynchronous replication slave thread was stopped when the node tried to apply the next replication event while the node was in non-primary state. But it would then remain stopped after the node successfully re-joined the cluster. A new variable, wsrep_restart_slave, has been implemented which controls if the MySQL slave should be restarted automatically when the node re-joins the cluster.
  • Handling install message and install state message processing has been improved to make group forming a more stable process in cases when many nodes are joining the cluster.
  • A new wsrep_evs_repl_latency status variable has been implemented which provides the group communication replication latency information.
  • Node consistency issues with foreign key grammar have been fixed. This fix introduces two new variables: wsrep_slave_FK_checks and wsrep_slave_UK_checks. These variables are set to TRUE and FALSE respectively by default. They control whether Foreign Key and Unique Key checking is done for applier threads.

Bugs Fixed:

  • Fixed the race condition in Foreign Key processing that could cause assertion. Bug fixed #1342959.
  • The restart sequence in scripts/mysql.server would fail to capture and return if the start call failed to start the server. As a result, a restart could occur that failed upon start-up, and the script would still return 0 as if it worked without any issues. Bug fixed #1339894.
  • Updating a unique key value could cause the server to hang if a slave node had enabled parallel slaves. Bug fixed #1280896.
  • Percona XtraDB Cluster has implemented threadpool scheduling fixes. Bug fixed #1333348.
  • garbd was returning an incorrect return code, ie. when garbd was already started, return code was 0. Bug fixed #1308103.
  • rsync SST would silently fail on joiner when the rsync server port was already taken. Bug fixed #1099783.
  • When gmcast.listen_addr was configured to a certain address, the local connection point for outgoing connections was not bound to the listen address. This would happen if the OS has multiple interfaces with IP addresses in the same subnet. The OS would pick the wrong IP for a local connection point and other nodes would see connections originating from an IP address which was not listened to. Bug fixed #1240964.
  • An issue with re-setting galera provider (in wsrep_provider_options) has been fixed. Bug fixed #1260283.
  • Variable wsrep_provider_options couldn’t be set in runtime if no provider was loaded. Bug fixed #1260290.
  • Percona XtraDB Cluster couldn’t be built with Bison 3.0. Bug fixed #1262439.
  • MySQL wasn’t handling exceeding the max writeset size wsrep error correctly. Bug fixed #1270920.
  • Fixed the issue which caused a node to hang/fail when SELECTs/SHOW STATUS was run after FLUSH TABLES WITH READ LOCK was used on a node with wsrep_causal_reads set to 1 while there was a DML on other nodes. Bug fixed #1271177.
  • Lowest group communication layer (evs) would fail to handle the situation properly when a large number of nodes would suddenly start recognizing each other. Bugs fixed #1271918 and #1249805.
  • Percona XtraBackup SST would fail if the progress option was used with a large number of files. Bug fixed #1294431.

NOTE: When performing an upgrade from an older 5.6 version on Debian/Ubuntu systems, in order to upgrade the Galera package correctly, you’ll need to pin the Percona repository and run: apt-get install percona-xtradb-cluster-56. This is required because older Galera deb packages have an incorrect version number. The correct wsrep_provider_version after upgrade should be 3.6(r3a949e6).

This release contains 50 fixed bugs. The complete list of fixed bugs can be found in our release notes.

Release notes for Percona XtraDB Cluster 5.6.19-25.6 are available in our online documentation along with the installation and upgrade instructions.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

Percona XtraDB Cluster Errata can be found in our documentation.

The post Percona XtraDB Cluster 5.6.19-25.6 is now available appeared first on MySQL Performance Blog.

Percona Monitoring Plugins 1.1.4 release

July 21, 2014 - 8:42am

Percona is glad to announce the release of Percona Monitoring Plugins 1.1.4.


* Added login-path support to Nagios plugins with MySQL client 5.6 (bug 1338549)
* Added a new threshold option for delayed slaves to pmp-check-mysql-replication-delay (bug 1318280)
* Added delayed slave support to pmp-check-mysql-replication-running (bug 1332082)
* Updated Nagios plugins and Cacti script to leverage lock-free SHOW SLAVE STATUS in Percona Server (bug 1297442)
* Fixed pmp-check-mysql-replication-delay integer-float issue with MariaDB and MySQL 5.6 (bugs 1245934, 1295795)
* ss_get_rds_stats.py was not installed with 755 permissions from the package (bug 1316943)
* Cacti MySQL template item “handler_savepoint_rollback” was GAUGE type instead of DERIVE (bug 1334173)
* Fixed Zabbix running-slave check issue on some Debian systems (bug 1310723)

A new tarball is available from downloads area or RPM and DEB packages from our software repositories. The plugins are fully supported for customers with a Percona Support contract and free installation services are provided as part of some contracts. In addition as part of Percona’s Remote DBA installation and setup of these tools are included with our services. You can find links to the documentation, forums and more at the project homepage.

About Percona Monitoring Plugins
Percona Monitoring Plugins are high-quality components to add enterprise-grade MySQL monitoring and graphing capabilities to your existing in-house, on-premises monitoring solutions. The components are designed to integrate seamlessly with widely deployed solutions such as Nagios, Cacti and Zabbix and are delivered in the form of templates, plugins, and scripts which make it easy to monitor MySQL performance.

The post Percona Monitoring Plugins 1.1.4 release appeared first on MySQL Performance Blog.

A schema change inconsistency with Galera Cluster for MySQL

July 21, 2014 - 12:00am

I recently worked on a case where one node of a Galera cluster had its schema desynchronized with the other nodes. And that was although Total Order Isolation method was in effect to perform the schema changes. Let’s see what happened.


For those of you who are not familiar with how Galera can perform schema changes, here is a short recap:

  • Two methods are available depending on the value of the wsrep_OSU_method setting. Both have benefits and drawbacks, it is not the main topic of this post.
  • With TOI (Total Order Isolation), a DDL statement is performed at the same point in the replication flow on all nodes, giving strong guarantees that the schema is always identical on all nodes.
  • With RSU (Rolling Schema Upgrade), a DDL statement is not replicated to the other nodes. Until the DDL statement has been executed on all nodes, the schema is not consistent everywhere (so you should be careful not to break replication).

You can look at the official document here.

If you read carefully the section on TOI, you will see that “[...] TOI transactions will never fail certification and are guaranteed to be executed.” But also that “The system replicates the TOI query before execution and there is no way to know whether it succeeds or fails. Thus, error checking on TOI queries is switched off.”

Confusing? Not really. It simply means that with TOI, a DDL statement will always pass certification. But if for some reason, the DDL statement fails on one of the nodes, it will not be rolled back on the other nodes. This opens the door for schema inconsistencies between nodes.

A test case

Let’s create a table on a 3-node Percona XtraDB Cluster 5.6 cluster and insert a few rows:

pxc1> create table t (id int not null auto_increment primary key, c varchar(10)); pxc1> insert into t (c) values ('aaaa'),('bbbb');

Then on node 3, let’s introduce a schema change on t that can make other schema changes fail:

pxc3> set global wsrep_OSU_method=RSU; pxc3> alter table t add d int; pxc3> set global wsrep_OSU_method=TOI;

As the schema change was done on node 3 with RSU, it is not replicated to the other nodes.

Now let’s try another schema change on node 1:

pxc1> alter table t add d varchar(10); Query OK, 0 rows affected (0,14 sec) Records: 0 Duplicates: 0 Warnings: 0

Apparently everything went well and indeed on node 1 and 2, we have the correct schema:

pxc2>show create table tG *************************** 1. row *************************** Table: t Create Table: CREATE TABLE t ( id int(11) NOT NULL AUTO_INCREMENT, c varchar(10) DEFAULT NULL, d varchar(10) DEFAULT NULL, PRIMARY KEY (id) ) ENGINE=InnoDB AUTO_INCREMENT=8 DEFAULT CHARSET=latin1

But on node 3, the statement failed so the schema has not been changed:

pxc3> show create table tG *************************** 1. row *************************** Table: t Create Table: CREATE TABLE t ( id int(11) NOT NULL AUTO_INCREMENT, c varchar(10) DEFAULT NULL, d int(11) DEFAULT NULL, PRIMARY KEY (id) ) ENGINE=InnoDB AUTO_INCREMENT=8 DEFAULT CHARSET=latin1

The error is visible in the error log of node 3:

2014-07-18 10:37:14 9649 [ERROR] Slave SQL: Error 'Duplicate column name 'd'' on query. Default database: 'repl_test'. Query: 'alter table t add d varchar(10)', Error_code: 1060 2014-07-18 10:37:14 9649 [Warning] WSREP: RBR event 1 Query apply warning: 1, 200 2014-07-18 10:37:14 9649 [Warning] WSREP: Ignoring error for TO isolated action: source: 577ffd51-0e52-11e4-a30e-4bde3a7ad3f2 version: 3 local: 0 state: APPLYING flags: 65 conn_id: 3 trx_id: -1 seqnos (l: 17, g: 200, s: 199, d: 199, ts: 7722177758966)

But of course it is easy to miss. And then a simple INSERT can trigger a shutdown on node3:

pxc2> insert into t (c,d) values ('cccc','dddd'); Query OK, 1 row affected (0,00 sec)

will trigger this on node 3:

2014-07-18 10:42:27 9649 [ERROR] Slave SQL: Column 2 of table 'repl_test.t' cannot be converted from type 'varchar(10)' to type 'int(11)', Error_code: 1677 2014-07-18 10:42:27 9649 [Warning] WSREP: RBR event 3 Write_rows apply warning: 3, 201 2014-07-18 10:42:27 9649 [Warning] WSREP: Failed to apply app buffer: seqno: 201, status: 1 at galera/src/trx_handle.cpp:apply():340 [...] 2014-07-18 10:42:27 9649 [Note] WSREP: Received NON-PRIMARY. 2014-07-18 10:42:27 9649 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 201) 2014-07-18 10:42:27 9649 [Note] WSREP: Received self-leave message. 2014-07-18 10:42:27 9649 [Note] WSREP: Flow-control interval: [0, 0] 2014-07-18 10:42:27 9649 [Note] WSREP: Received SELF-LEAVE. Closing connection. 2014-07-18 10:42:27 9649 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 201) 2014-07-18 10:42:27 9649 [Note] WSREP: RECV thread exiting 0: Success 2014-07-18 10:42:27 9649 [Note] WSREP: recv_thread() joined. 2014-07-18 10:42:27 9649 [Note] WSREP: Closing replication queue. 2014-07-18 10:42:27 9649 [Note] WSREP: Closing slave action queue. 2014-07-18 10:42:27 9649 [Note] WSREP: bin/mysqld: Terminated.


As on regular MySQL, schema changes are challenging with Galera. Some subtleties can create a lot of troubles if you are not aware of them. So before running DDL statement, make sure you fully understand how TOI and RSU methods work.

The post A schema change inconsistency with Galera Cluster for MySQL appeared first on MySQL Performance Blog.

Systemtap solves phantom MySQLd SIGTERM / SIGKILL issue

July 18, 2014 - 7:38am

The Percona Managed Services team recently faced a somewhat peculiar client issue. We’d receive pages about their MySQL service being unreachable. However, studying the logs showed nothing out of the ordinary…. for the most part it appeared to be a normal shutdown and there was nothing in anyone’s command history nor a cron task to speak of that was suspicious.

This is one of those obscure and peculiar (read: unique) issues that triggered an old memory; I’d seen this behavior before and I had just the tool to catch the culprit in the act.

Systemtap made diagnostics of this issue possible and I can’t state enough how much of a powerful and often under-utilized tool set systemtap really is.

cat > signals.stp << EOF
probe signal.send {
if (sig_name == “SIGKILL” || sig_name == “SIGTERM”)
printf(“[%s] %s was sent to %s (pid:%d) by %s uid:%dn”,
ctime(gettimeofday_s()), sig_name, pid_name, sig_pid, execname(), uid())

sudo stap ./signals.stp > signals.log 2>signals.err

grep mysqld signals.log
[Wed Jun 11 19:03:23 2014] SIGKILL was sent to mysqld (pid:8707) by cfagent uid:0
[Fri Jun 13 21:37:27 2014] SIGKILL was sent to mysqld (pid:6583) by cfagent uid:0
[Sun Jun 15 05:05:34 2014] SIGKILL was sent to mysqld (pid:19818) by cfagent uid:0
[Wed Jul 9 07:03:47 2014] SIGKILL was sent to mysqld (pid:4802) by cfagent uid:0

Addendum: It had been so long since I had used this tooling that I could not remember the original source from which I derived the module above; some cursory searching to rectify this issue for this blog post found this original source by Eugene Teo of Red Hat made available under GPLv2.

From this we were able to show that cfagent was killing the mysqld process presumably via a misconfigured job; this information was returned to the client and this has continued to be run in production for two months now at the client’s request with no issues to speak of.

This is by no means the limit to what systemtap can be used to achieve; you can hook into functions though whilst you may need to install the debug packages to find what functions are available run for example:

sudo stap -L 'process("/usr/sbin/mysqld").function("*")' > /tmp/mysql_stapfunc
head /tmp/mysql_stapfunc

This is also true of the kernel using sudo stap -L 'kernel.function("*")' > /tmp/kernel_stapfunc however you must be booted into a debug kernel for this to function.

Systemtap is more than a worthy tool to have at your disposal with plenty of examples available.

Finally I invite you to join me July 23 at 10 a.m. Pacific time for my webinar, “What Every DBA Needs to Know About MySQL Security.” This detailed technical webinar provides insight into best security practices for either setting up a new MySQL environment or upgrading the security of an existing one. I hope to see you there!

The post Systemtap solves phantom MySQLd SIGTERM / SIGKILL issue appeared first on MySQL Performance Blog.

Q&A: Even More Deadly Mistakes of MySQL Development

July 17, 2014 - 9:57am

On Wednesday I gave a presentation on “How to Avoid Even More Common (but Deadly) MySQL Development Mistakes” for Percona MySQL Webinars.  If you missed it, you can still register to view the recording and my slides.

Thanks to everyone who attended, and especially to folks who asked the great questions.  I answered as many as we had time for  during the session, but here are all the questions with my complete answers:

Q: Disk bandwidth also not infinite

Indeed, you’re right!

We discussed in the webinar the impact on network bandwidth from using column wildcards in queries like SELECT *, but it’s also possible that using SELECT * can impact disk operations. Varchar, Blob, or Text columns can be stored on extra pages in the database, and if you include those columns in your query needlessly, it can cause the storage engine to do a lot of seeks and page reads unnecessarily.

For more details on string storage in InnoDB, see Peter Zaitsev’s blog on Blob Storage in Innodb.

Q: How many tables can be joined in a single query? What is the optimal number of joins?

MySQL has a limit of 63 table references in a given query. This limits how many JOIN operations you can do, and also limits the number of UNIONs. Actually you can go over this limit if your JOIN or UNION don’t reference any tables, that is, create a derived table of one row of expressions.

If you do join a lot of tables (or even self-join the same table many times), you’re likely to hit a practical scaling limit long before you reach 63 table references. The practical limit in your case depends on many factors, including the length of the tables, the data types, the type of join expressions in your queries, and your physical server’s capabilities. It’s not a fixed limit I can cite for you.

If you think you need dozens of table references in a single query, you should probably step back and reconsider your database design or your query design.

I often see this type of question (“what is the limit on the number of joins?”) when people try to use key/value tables, also called Entity-Attribute-Value, and they’re trying to pivot attributes from rows into columns, as if the table were stored in a conventional way with one column per attribute. This is a broken design for many reasons, and the scalability of many-way joins is just one problem with it.

Q: How many indexes can be created in a single table? Any limitation? What is the optimal number of indexes?

All MySQL storage engines support at least 16 indexes per table.

As far as the optimal number of indexes, I don’t pay attention to the number of indexes (as long as it remains lower than the max of 16). I try to make sure I have the right indexes for my queries. If you put an arbitrary cap of for example 8 or 10 indexes on a given table, then you might be running queries that lack a needed index, and the unnecessary extra cost of running that query is probably greater than the cost of maintaining the one extra index it needs.

That said, there are cases where you have such variation in query types that there’s no way to have optimal indexes to cover every possible case. Given that you can have multi-column indexes, and multi-column indexes with columns in different orders, there are n-factorial possible indexes on a table with n columns.

Q: There is a table with 3 columns: id(int), user_id(int), day(date). There is a high chance same user_id will ‘exist’ for every day. I read data by “where user_id = some_id” (very high throuhput) and delete all entries once a day by cron using “where sent_date = ’2014-01-01′ “. Have approx 6M rows per day deletion is pretty painfull. Will partitioning by column ‘day’ help me deleting those bulks faster? If yes – how much faster? How much will it slow down SELECTs? – not all entries are deleted, but only entries for some specific old day, e.g. ‘ WHERE day = ’1 week ago’

Range partitioning by date would give you the opportunity to ALTER TABLE…DROP PARTITION, so you could remove all data for a given date very quickly, much faster than deleting millions of rows. The performance of DROP PARTITION is like that of DROP TABLE, because each partition is physically stored like a separate table.

Searching for “where user_id = ?” would not be able to take advantage of partition pruning, but it would still be able to use an index on user_id. And if you drop old partitions, the benefit of searching a smaller table could be a good tradeoff.

Q: Regarding 20% selectivity as a threshold for the optimizer preferring a table-scan to an index lookup – is that a tunable?

No, it’s not tunable, it’s a fixed behavior of the query optimizer. If you search for a value and the optimizer estimates that > 20% of rows contain the value you search for, it will bypass the index and just do a table-scan.

For the same reason that the index of a book doesn’t contain very common words, because the list of pages that word appears on would be too long, and flipping back and forth from the back of the book to each listed page would actually be more work than just reading the book.

Also keep in mind my figure of 20% is approximate. Your results may vary. This is not a magic threshold in the source code, it’s just a tendency I have observed.

Q: Regarding generating synthetic test data, it sounds like a pretty easy perl script to write.

Yes, it might be easy to do that for one given table. But every table is different, and you might have hundreds of tables in dozens of applications to generate test data for. You might also want to vary the distribution of data values from one test to another.

Writing a test-data generator for one particular case is easy, so you might reasonably do it as a one-off task. Writing a general-purpose test-data generator that you can use for many cases is more work.

Q: Would love to have the set of URLs cited in the presentation without having to go back and mine them out of the presentation.

Open source message queues:

MySQL Performance Blog articles:

Open source test-data generator:

Load-testing tools for web applications:

Load-testing tools to replay query logs:

Further reading for implementing business rules:

Q: How to best use mysql query cache?

Any cache is best used if you read from it many times for each time you write to it. So we’d like to estimate the average ratio of query cache reads to writes, to estimate how much leverage it’s giving us.


Check the values for QCache_hits (which are cases when a query result was read from the query cache) over QCache_inserts (which are cases when the desired query result was not in the cache, and had to be run and then the result stored in the cache). I like to see a ratio of 1000% or more (i.e. 10:1 hits to inserts).

If you have a poor ratio, for example less than 1:1 or less than 100%, then consider disabling the query cache, because it may be costing more to maintain it than the performance benefit it’s giving you.

Keep in mind that this is only a guideline, because the calculation I described is only an average. It could be that the queries served by the query cache are very expensive, so using the cached result is a great benefit even if it accounts for a small number of hits. The only way to be certain is to load-test your application under your load, and compare overall performance results with the query cache enabled or disabled, and at different sizes.

Q: How to detect when too much indexes start to affect performance?

Some people are reluctant to create indexes because they have been warned that indexes require synchronous updates when you INSERT, UPDATE, or DELETE rows. Some people also make the generalization that indexes harm writes but benefit reads. Bot of these are not true.

Your DML operations aren’t really updating indexes in real time. InnoDB includes a feature called change buffering, which defers index updates. The change buffer is gradually merged into the index over time. That way, InnoDB can handle a big spike in traffic without it hurting throughput as much. You can monitor how much content in the change buffer remains to be merged:
mysql> SHOW GLOBAL STATUS LIKE 'Innodb_ibuf_size';

It’s also not accurate that indexes hurt writes. UPDATE and DELETE statements usually have a WHERE clause, to apply the changes to particular rows. These conditions use indexes to reduce the examined rows, just like in SELECT statements. But in UPDATE and DELETE statements, it’s even more important to use indexes, because otherwise the statement has to lock a lot of rows to ensure it locks the rows you’re changing.

So I generally say, don’t avoid indexes based only on the number of indexes you have, just make sure your indexes are being employed by the queries you run, and drop indexes that aren’t used. Here are a couple of past blog posts that show how to do this:

Thanks again for attending my webinar!  Here are some more tips:

  • Check out upcoming Percona Training classes in North America and Europe.
  • Join Percona and the MySQL community at our Percona Live.
  • Watch more webinars from Percona in the future!

The post Q&A: Even More Deadly Mistakes of MySQL Development appeared first on MySQL Performance Blog.

High Availability with mysqlnd_ms on Percona XtraDB Cluster

July 16, 2014 - 7:11am

This is the second part of my series on High Availability with mysqlnd_ms. In my first post, “Simple MySQL Master HA with mysqlnd_ms,” I showed a simple HA solution using asynchronous MySQL replication. This time we will see how to leverage an all-primary cluster where you can write to all nodes. In this post I used Percona XtraDB Cluster, but you should also be able to do the same with MySQL NDB Cluster or Tungsten Replicator.

To start with, here is the mysqlnd_ms configuration I used:mysqlnd_ms_mm.ini.  All of these files are available from my Github repository. Below, I have three Percona XtraDB Cluster nodes, all defined as masters and no slaves. I’ve configured a roundrobin filter where all connections will happen on the first node, in this case192.168.56.44 . In case the first node fails, the second node will be used and so forth until no more nodes are available. Another interesting configuration option here is the loop_before_master strategy whereby if connection or a statement to the current server fails, it will be retried silently on the remaining nodes before returning an error to the user, more on this below.

{ "primary": { "master": { "master_1": { "host": "", "port": "3306" }, "master_2": { "host": "", "port": "3306" }, "master_3": { "host": "", "port": "3306" } }, "slave": { }, "filters": { "roundrobin": [ ] }, "failover": { "strategy": "loop_before_master", "remember_failed": true } } }

Similar to my previous post, I also used a custom INI file for PHP to use, this time aptly namedmaster-master.ini :

mysqlnd_ms.enable = 1 mysqlnd_ms.disable_rw_split = 1 mysqlnd_ms.multi_master = 1 mysqlnd_ms.force_config_usage = 1 mysqlnd_ms.config_file = /home/revin/git/demo-me/phpugph201407/mysqlnd_ms_mm.ini

A new addition to this configuration ismysqlnd_ms.multi_master , when enabled it would allow you to use all nodes or just one and treat the others as passive. The PHP script I used this time is calledmaster-master.php , it is largely similar tomaster-slave-ng.phpwith a few differences:

  1. There is no need for /tmp/PRIMARY_HAS_FAILED  sentinel as all nodes were writable.
  2. There is no need for /*ms=master*/  SQL hint when validating a connection from connect_mysql function since all nodes acts as master.

So here is a quick test, first with roundrobin filter, after 4 INSERTs, I shutdown  which sends my connection to the next server in the configuration, . When I started back  again, the script resumed connections there. Pretty cool right?

[revin@forge phpugph201407]$ php -c master-master.ini master-master.php Last value 3564 from host via TCP/IP and thread id 19 Last value 3565 from host via TCP/IP and thread id 20 Last value 3566 from host via TCP/IP and thread id 21 Last value 3567 from host via TCP/IP and thread id 22 Warning: connect_mysql(): MySQL server has gone away in /home/revin/git/demo-me/phpugph201407/master-master.php on line 63 Warning: connect_mysql(): Error while reading greeting packet. PID=23464 in /home/revin/git/demo-me/phpugph201407/master-master.php on line 63 ERRROR: via TCP/IP [2006] MySQL server has gone away on line 30 Last value 0 from host and thread id 0 Last value 3568 from host via TCP/IP and thread id 1552 Last value 3569 from host via TCP/IP and thread id 1553 [...] Last value 3584 from host via TCP/IP and thread id 1568 Last value 3585 from host via TCP/IP and thread id 18

Here’s another test using the random filter which allows you to write to all nodes, on my mysqlnd_ms_mm.ini above, I just changed roundrobin  torandom . As you can see, all three nodes were being used, of course in random, at the same time you will also see when I shutdown  around where the connect_mysql  errors and then the server was used again near the bottom after a started it back up. Still pretty cool right?

[revin@forge phpugph201407]$ php -c master-master.ini master-master.php Last value 3590 from host via TCP/IP and thread id 2060 Last value 3591 from host via TCP/IP and thread id 1569 Last value 3592 from host via TCP/IP and thread id 1570 Warning: connect_mysql(): MySQL server has gone away in /home/revin/git/demo-me/phpugph201407/master-master.php on line 63 Warning: connect_mysql(): Error while reading greeting packet. PID=23919 in /home/revin/git/demo-me/phpugph201407/master-master.php on line 63 ERRROR: via TCP/IP [2006] MySQL server has gone away on line 30 Last value 0 from host and thread id 0 Last value 3593 from host via TCP/IP and thread id 2061 Last value 3594 from host via TCP/IP and thread id 2062 Last value 3595 from host via TCP/IP and thread id 2063 Last value 3596 from host via TCP/IP and thread id 2064 Last value 3597 from host via TCP/IP and thread id 1576 Last value 3598 from host via TCP/IP and thread id 1577 Last value 3599 from host via TCP/IP and thread id 1578 Last value 3600 from host via TCP/IP and thread id 1579 Last value 3601 from host via TCP/IP and thread id 2065 Last value 3602 from host via TCP/IP and thread id 1581 Last value 3603 from host via TCP/IP and thread id 1582 Last value 3604 from host via TCP/IP and thread id 2066 Last value 3605 from host via TCP/IP and thread id 19 Last value 3606 from host via TCP/IP and thread id 1583 Last value 3607 from host via TCP/IP and thread id 21

So here are some issues I’ve observed during these tests:

  1. remember_failed  during failover does not work as advertised. Supposedly, a failed node should not be used again for every connection request but in my test, this is not the case. See more from this bug. This means that if you have 2 out of 3 failed nodes in this scenario the overhead would be too big when testing both connections. Perhaps some sort of in memory shared TTL can be used to overcome this? I’m not sure.
  2. If you look closely around line 7 on my last output above the error displayed is kind of misleading. In particular it saysERRROR: via TCP/IP , whereby it was not  that failed, it was192.168.56.43 . This is because under the hood, immediately after failure the next node will be cycled to, this is especially true since we have loop_before_master configured. I sure do have a bug on the script that should capture the host_info  properly, but this is something to always keep in mind so you don’t keep scratching your head.

So we’ve seen these two forms of possibilities and they definitely have use cases and advantages. On the other hand because of the issues we have found so far (I’ve reported 4 bugs on the PHP bugs database during the course of these tests including one crashing), I recommend to make sure you test seriously before putting this on production.

The post High Availability with mysqlnd_ms on Percona XtraDB Cluster appeared first on MySQL Performance Blog.

TokuDB tips: MySQL backups

July 15, 2014 - 3:00am

In my recent post, “TokuDB gotchas: slow INFORMATION_SCHEMA TABLES,” I saw a couple questions and tweets asking if we use TokuDB in production. Actually I mentioned it in that post and we also blogged about it in a couple of other recent posts:

So, yes, we are using Percona Server + TokuDB as a main storage engine in Percona Cloud Tools to store timeseries data.

And, yes, Percona Server + TokuDB is available GA Percona Server 5.6.19-67.0 with TokuDB (GA).

Just having good performance is not enough to make it into production; there are also operational questions and one such question is about backups. I want to explain how we do backups for Percona Server + TokuDB in Percona Cloud Tools.

I should say up front, that we DO NOT have support for TokuDB in Percona XtraBackup. TokuDB internals are significantly different from InnoDB/XtraDB, so it will be a major project to add this to Percona XtraBackup and we do not have any plans at the moment to work on this.

It does not mean that TokuDB users do not have options for backups. There is Tokutek Hot back-up, included in the Tokutek Enterpise Subscription. And there is a method we use in Percona Cloud Tools: LVM Backups. We use mylvmbackup scripts for this task and it works fairly well for us.

There is however some gotchas to be aware. If you understand an LVM backups mechanic, this is basically a managed crash recovery process when you restore from a backup.

Now we need to go in a little detail for TokuDB. To support transactions that involve both TokuDB and InnoDB engines, TokuDB uses a two-phase commit mechanism in MySQL. When involved, the two-phase commit requires binary logs presented for a proper recovery procedures.

But now we need to take a look at how we setup a binary log in Percona Cloud Tools. We used SSD for the main data storage (LVM partition is here) and we use a Hardware RAID1 over two hard-drives for binary logs. We choose this setup as we care about SSD lifetime. In write-intensive workloads, binary logs will produce a lot of write operations and in our calculation we will just burn these SSDs, so we have to store them on something less expensive.

So the problem there is that when we take an LVM snapshot over main storage, we do not have a consistent view of binary logs (although it is possible to modify backup scripts to copy the current binary log under FLUSH TABLES WITH READ LOCK operation, this is probably what we will do next). But binary logs are needed for recovery, without them we face these kind of errors during restoring from backup:

2014-DD-MM 02:15:16 16414 [Note] Found 1 prepared transaction(s) in TokuDB 2014-DD-MM 02:15:16 16414 [ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions. 2014-DD-MM 02:15:16 16414 [ERROR] Aborting

The error message actually hints a way out. Unfortunately it seems that we are the first ones to have ever tried this option, as tc-heuristic-recover is totally broken in current MySQL and not supposed to work… and it would be noticed if someone really tried it before us (which gives me an impression that Oracle/MySQL never properly tested it, but that is a different story).

We will fix this in Percona Server soon.

So the way to handle a recovery from LVM backup without binary logs is to start mysqld with –tc-heuristic-recover switch (unfortunately I did not figure out yet, should it be COMMIT or ROLLBACK value, hehe).

The proper way to use LVM backup is to have a corresponding binary log file, like I said it will require a modification to mylvmbackup script.

I should say this is not the only way we do backups in Percona Cloud Tools. In this project we use Percona Backup Service provided by the Percona Managed Services team, and our team also uses mydumper to perform a logical backup of data.
While it works acceptably to backup hundreds of gigabytes worth of data (it is just a sequential scan, which should be easy for TokuDB), the full recovery is painful and takes unacceptably long. So mydumper backup (recovery) will be used if we ever need to perform a fine-grained recovery (i.e only small amount of specific tables).

So I hope this tip is useful if you are looking for info about how to do backups for TokuDB.

The post TokuDB tips: MySQL backups appeared first on MySQL Performance Blog.

Simple MySQL Master HA with mysqlnd_ms

July 14, 2014 - 7:51am

I had the pleasure of presenting to the PHP Users Group Philippines a few days ago about mysqlnd_ms. The mysqlnd plugin, MySQL Master Slave, is a transparent layer on top of mysqlnd extension. This allows you to do read-write splitting and slave reads load balancing without needing to change anything from your application. But do you know you can also achieve a form of high availability with this plugin? I shared 2 forms on my presentation, using async MySQL replication either in master-slave configuration or master-master configuration, while the second form is having an all primary cluster where you can write to all nodes.

This first part is to demonstrate how you can achieve a simple HA solution using the first form. First, all the sample code here can be found on my GitHub repository. So, to use the mysqlnd_ms plugin, it uses an additional external configuration file in JSON format. This configuration file, will define your master and slave nodes, failover properties and any filters (connection selection method) you want to dictate how the algorithm will provide you the connection.

Let’s start with the mysqlnd_ms configuration I used,mysqlnd_ms_ms.ini :

{ "primary": { "master": { "master_1": { "host": "", "port": "33001" } }, "slave": { } }, "standby": { "master": { "master_1": { "host": "", "port": "33002" } }, "slave": { } } }

Here, I have two applications defined, one called “primary” and another called “standby”, I have not defined any slaves for simplicity. The two MySQL instances running on port 33001 and 33002 are in master-master configuration.

mysqlnd_ms.enable = 1 mysqlnd_ms.disable_rw_split = 1 mysqlnd_ms.force_config_usage = 1 mysqlnd_ms.config_file = /home/revin/git/demo-me/phpugph201407/mysqlnd_ms_ms.ini

This is the custom INI file I used for the tests,master-slave.ini . The first line simply enables the plugin for use. The second line, mysqlnd_ms.disable_rw_split instructs the plugin that I should only send all queries to the master because I only have masters for this test.

As for the PHP script, the full copy can be found here, as it is a bit lengthy I will just explain the logic on what it does.

  1. To start the test, it bootstraps the test table via DROP and then CREATE queries.
  2. It then enters a for loop where it will execute an INSERT followed by a SELECT to validate the newly inserted row and additional information like the current active server id and the connection id.
  3. For every iteration of the loop, a new mysqli object is created to simulate non-persistent connections to the database server.
  4. To create the new connection, a call to the function connect_mysql  is made which returns a mysqli object when successful. An important thing to remember here is that mysqlnd_ms uses lazy connections by default, this means that when the mysqli object is created, it is not really connected yet to the server. One has to issue a statement like 'SELECT 1'  to start the connection manually or callmysqli::real_connect . Not even mysqli::ping  does not work without the former, I’ve opened this bug.
  5. After the mysqli object is returned, the INSERT statement will trigger mysqlnd_ms to actually establish the connection and then execute the statement. This is where the good part is, if the connection cannot be made, the query_write_mysql function will know and will re-request the connection from connect_mysql, this time within the connect_mysql function, connection to the primary will be retried at least 10 times if the type of error from the previous failure is something related to a connection like error numbers 2002  and2003 . If the connection cannot be established after 10 retries, the application creates a sentinel file as /tmp/PRIMARY_HAS_FAILED  and will retry the connection to the secondary (slave or passive-master).

Here is an example run, my primary has a server id or 101 while my standby is 102:

[revin@forge phpugph201407]$ php -c master-slave.ini master-slave-ng.php Last value 0001 from server id 101 thread id 7 Last value 0003 from server id 101 thread id 8 37: [2002] Connection refused Connection to host 'primary' failed: [0] Connection refused, retrying (1 of 10) in 3 seconds Connection to host 'primary' failed: [0] Connection refused, retrying (2 of 10) in 3 seconds Connection to host 'primary' failed: [0] Connection refused, retrying (3 of 10) in 3 seconds Connection to host 'primary' failed: [0] Connection refused, retrying (4 of 10) in 3 seconds Connection to host 'primary' failed: [0] Connection refused, retrying (5 of 10) in 3 seconds Connection to host 'primary' failed: [0] Connection refused, retrying (6 of 10) in 3 seconds Connection to host 'primary' failed: [0] Connection refused, retrying (7 of 10) in 3 seconds Connection to host 'primary' failed: [0] Connection refused, retrying (8 of 10) in 3 seconds Connection to host 'primary' failed: [0] Connection refused, retrying (9 of 10) in 3 seconds Connection to host 'primary' failed: [0] Connection refused, retrying (10 of 10) in 3 seconds The primary host 'primary' has failed after 30 seconds, failing over to standby! 52: [2002] Connection refused Last value 0004 from server id 102 thread id 635 Last value 0006 from server id 102 thread id 636 Last value 0008 from server id 102 thread id 637 [...]

This is not the perfect setup and there are a number of limitations, however it tells us that if you have a simple HA requirement like if you’re not running a very critical application but still do not want to be waken up at night but rather deal with issues in the morning, this might just fit. So here are some more notes:

    • If you have a master-slave configuration, you just basically shot your primary (master) in the foot during the failover. You may need to rebuild its data in the morning.
    • If instead you have master-master, you might just be able to bring the primary master back online, get it caught up in replication and then delete /tmp/PRIMARY_HAS_FAILED  file to switch your application back to it.
    • The use of /tmp/PRIMARY_HAS_FAILED  sentinel file is rudimentary, its not the only way. You should consider sending notifications to yourself when failover happens because this method requires human intervention to put back the primary master back in rotation.

The same effect can be achieved with a little more coding, but you can already take advantage of the plugin with less.

I’ve also tested the plugin on the second form where you can write to multiple masters using Percona XtraDB Cluster. I’ve found a few interesting issues there so stay tuned.

The post Simple MySQL Master HA with mysqlnd_ms appeared first on MySQL Performance Blog.

Managing shards of MySQL databases with MySQL Fabric

July 11, 2014 - 2:00am

This is the fourth post in our MySQL Fabric series. In case you’re joining us now, we started with an introductory post, and then discussed High Availability (HA) using MySQL Fabric here (Part 1) and here (Part 2). Today we will talk about how MySQL Fabric can help you scale out MySQL databases with sharding.


At the time of writing, MySQL Fabric includes support for range- and hash-based sharding. As with HA, the functionality is split between client, through a MySQL Fabric-aware connector; and server, through the mysqlfabric utility and the XML-RPC server we’ve talked about before.

In this post, we’ll go through the process of setting up a sharded table for use with MySQL Fabric, and then go through some usage examples, again using the Python connector.

In our next post, we’ll talk about shard management operations, and go into more detail about how we can combine the Sharding and HA features of MySQL Fabric.

The architecture

For our examples, we’ll be using a sharding branch from our vagrant-fabric repository. If you have been following previous posts and already have a local copy of the repo, you can get this one just by running the following command:

git checkout sharding

from the root of your copy. Bear in mind that the node names are the same in the Vagrantfile, so while in theory  just running vagrant provision should be enough, you may have to run vagrant destroy and vagrant up again, if you hit unexpected behavior.

The only difference between this branch and the original one is that you’ll have two mysqld instances per node: one on port 3306 and one on port 13306. This will let us achieve high availability for our shard groups. But don’t worry about that for now, it’s something we’ll discuss more in depth in our next post.

In today’s examples, we’ll be using the three group architecture described by this diagram:

The blue boxes represent shard-groups and the green box represent the global-group. The red arrows indicate the flow of replication and the violet arrows represent client connections.

Setting up sharding

The official documentation about sharding with MySQL Fabric can be found here. We’ll be using the same example employees database and shard the salaries table.

As we said, to keep things simple for the introduction, we’ll create all the groups but only add one instance to each one of them. In our next post, we’ll use two instances per group to evaluate how MySQL Fabric can make our shards highly available, and how it can rearrange replication topologies automatically after a failure.

To start, let’s create three groups:

[vagrant@store ~]$ mysqlfabric group create salaries-global Procedure : { uuid = 390aa6c0-acda-40e2-ad52-8c0869613635, finished = True, success = True, return = True, activities = } [vagrant@store ~]$ for i in 1 2; do mysqlfabric group create salaries-$i; done Procedure : { uuid = 274742a2-5e84-49b8-8446-5a8fc55f1899, finished = True, success = True, return = True, activities = } Procedure : { uuid = 408cfd6a-ff3a-493e-b39b-a3241d83fda6, finished = True, success = True, return = True, activities = }


The global group will be used to propagate schema changes and to store unpartitioned data. Think of configuration tables that don’t need to be sharded, for example.

The other two groups will host shards, that is, tables that will have the same structure across all the nodes, but not the same data (and that will be empty in the global group’s nodes).

Now, let’s add one instance to each group:

[vagrant@store ~]$ mysqlfabric group add salaries-global node1:3306 Procedure : { uuid = 0d0f657c-9304-4e3f-bf5b-a63a5e2e4390, finished = True, success = True, return = True, activities = } [vagrant@store ~]$ mysqlfabric group add salaries-1 node2:3306 Procedure : { uuid = b0ee9a52-49a2-416e-bdfd-eda9a384f308, finished = True, success = True, return = True, activities = } [vagrant@store ~]$ mysqlfabric group add salaries-2 node3:3306 Procedure : { uuid = ea5d8fc5-d4f9-48b1-b349-49520aa74e41, finished = True, success = True, return = True, activities = }

We also need to promote the groups. Even though each group has a single node, MySQL Fabric sets up that node as SECONDARY, which means it can’t take writes.

[vagrant@store ~]$ mysqlfabric group promote salaries-global Procedure : { uuid = 5e764b97-281a-49f0-b486-25088a96d96b, finished = True, success = True, return = True, activities = } [vagrant@store ~]$ for i in 1 2; do mysqlfabric group promote salaries-$i; done Procedure : { uuid = 7814e96f-71d7-4865-a278-cb6ed32a2d11, finished = True, success = True, return = True, activities = } Procedure : { uuid = cd30e9a9-b9ea-4b2d-a8ae-5e70f22363d6, finished = True, success = True, return = True, activities = }

Finally, we are ready to create a shard definition and associate ranges to groups:

[vagrant@store ~]$ mysqlfabric sharding create_definition RANGE salaries-global Procedure : { uuid = fffcbb5f-24c6-47a2-9348-f1d810c8ef2f, finished = True, success = True, return = 1, activities = } [vagrant@store ~]$ mysqlfabric sharding add_table 1 employees.salaries emp_no Procedure : { uuid = 8d0a3c51-d543-49a6-b47a-36a4ab499ab4, finished = True, success = True, return = True, activities = } [vagrant@store ~]$ mysqlfabric sharding add_shard 1 "salaries-1/1, salaries-2/25000" --state=ENABLED Procedure : { uuid = 2585a5ea-a097-44a4-89fa-a948298d0595, finished = True, success = True, return = True, activities =

The integer after each shard group is the lower bound for emp_no values found on that shard.

After the last command, the shard groups should be replicating off the global one. We can verify that this is the case by checking salaries-3:

[vagrant@node3 ~]$ mysql -uroot -e 'show slave statusG' *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: node1 Master_User: fabric Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.000001 Read_Master_Log_Pos: 151 Relay_Log_File: mysqld-relay-bin.000002 Relay_Log_Pos: 361 Relay_Master_Log_File: mysql-bin.000001 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 151 Relay_Log_Space: 566 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 870101 Master_UUID: e34ab4cd-00b9-11e4-8ced-0800274fb806 Master_Info_File: /var/lib/mysql/master.info SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it Master_Retry_Count: 86400 Master_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: Master_SSL_Crl: Master_SSL_Crlpath: Retrieved_Gtid_Set: Executed_Gtid_Set: Auto_Position: 1

Looks good. Let’s go ahead and create the database schema. To avoid being too verbose, we’re only including the create statement for the salaries table in this example. Notice we run this on the PRIMARY node for the global group:

[vagrant@node1 ~]$ mysql -uroot Welcome to the MySQL monitor. Commands end with ; or g. Your MySQL connection id is 8 Server version: 5.6.19-log MySQL Community Server (GPL) Copyright (c) 2000, 2014, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or 'h' for help. Type 'c' to clear the current input statement. mysql> CREATE DATABASE IF NOT EXISTS employees; Query OK, 1 row affected (0.01 sec) mysql> USE employees; Database changed mysql> CREATE TABLE salaries ( -> emp_no INT NOT NULL, -> salary INT NOT NULL, -> from_date DATE NOT NULL, -> to_date DATE NOT NULL, -> KEY (emp_no)); Query OK, 0 rows affected (0.06 sec)

And again, check that it made it to the shard groups:

[vagrant@node2 ~]$ mysql -uroot -e 'show databases' +--------------------+ | Database | +--------------------+ | information_schema | | employees | | mysql | | performance_schema | +--------------------+

Good. We’re now ready to use the Python connector and load some data into this table. We’ll be using the following script:

import mysql.connector from mysql.connector import fabric from mysql.connector import errors import time import random import datetime config = { 'fabric': { 'host': 'store', 'port': 8080, 'username': 'admin', 'password': 'admin', 'report_errors': True }, 'user': 'fabric', 'password': 'f4bric', 'database': 'employees', 'autocommit': 'true' } from_min = datetime.datetime(1980,1,1,00,00,00) to_max = datetime.datetime(2014,1,1,00,00,00) fcnx = None print "starting loop" while 1: if fcnx == None: print "connecting" fcnx = mysql.connector.connect(**config) fcnx.reset_cache() try: print "will run query" emp_no = random.randint(1,50000) salary = random.randint(1,200000) from_date = from_min + datetime.timedelta(seconds=random.randint(0, int((to_max - from_min).total_seconds()))) to_date = from_min + datetime.timedelta(seconds=random.randint(0, int((to_max - from_min).total_seconds()))) fcnx.set_property(tables=["employees.salaries"], key=emp_no, mode=fabric.MODE_READWRITE) cur = fcnx.cursor() cur.execute("insert into employees.salaries (emp_no,salary,from_date,to_date) values (%s, %s, %s, %s)",(emp_no,salary,from_date,to_date)) print "inserted", emp_no, ", will now sleep 1 second" time.sleep(1) except (errors.DatabaseError, errors.InterfaceError): print "sleeping 1 second and reconnecting" time.sleep(1) del fcnx fcnx = None

This is similar to the script we used in our HA post. It inserts rows with random data in an endless loop. The sleep on every iteration is there just to make it easier to cancel the script, and to keep row insert rate under control.

If you leave this running for a while, you should then be able to check the global server and individual shards, and confirm they have different data:

[vagrant@store ~]$ for i in 1 2 3; do mysql -ufabric -pf4bric -hnode$i -e "select count(emp_no),max(emp_no) from employees.salaries"; done Warning: Using a password on the command line interface can be insecure. +---------------+-------------+ | count(emp_no) | max(emp_no) | +---------------+-------------+ | 0 | NULL | +---------------+-------------+ Warning: Using a password on the command line interface can be insecure. +---------------+-------------+ | count(emp_no) | max(emp_no) | +---------------+-------------+ | 36 | 24982 | +---------------+-------------+ Warning: Using a password on the command line interface can be insecure. +---------------+-------------+ | count(emp_no) | max(emp_no) | +---------------+-------------+ | 43 | 49423 | +---------------+-------------+

As you can see, the global group’s server has no data for this table, and each shard’s server has data within the defined boundaries.

Querying data is done similarly (though with a READ_ONLY connection), and we can also lookup the group a row belongs to using the mysqlfabric utility directly:

[vagrant@store ~]$ mysqlfabric sharding lookup_servers employees.salaries 2045 Command : { success = True return = [['ecab7dd2-00b9-11e4-8cee-0800274fb806', 'node2:3306', True]] activities = } [vagrant@store ~]$ mysqlfabric sharding lookup_servers employees.salaries 142045 Command : { success = True return = [['f8a90096-00b9-11e4-8cee-0800274fb806', 'node3:3306', True]] activities = }

Bear in mind that this lookups only use the fabric store, which means they can tell you on which servers a given row may be, but can’t confirm if the row exists or not. You need to actually query the given servers for that. If you use the connector, both steps are done for you when you issue the query.

The following code snippets illustrate the point:

>>> fcnx = mysql.connector.connect(**config) >>> emp_no = random.randint(1,50000) >>> fcnx.set_property(tables=["employees.salaries"], key=emp_no, mode=fabric.MODE_READONLY) >>> >>> cur = fcnx.cursor() >>> cur.execute("select count(*) as cnt from employees.salaries where emp_no = %s", (emp_no,)) >>> >>> for row in cur: ... print row ... (0,) >>> fcnx.set_property(tables=["employees.salaries"], key=20734, mode=fabric.MODE_READONLY) >>> cur = fcnx.cursor() >>> cur.execute("select count(*) as cnt from employees.salaries where emp_no = 20734") >>> for row in cur: ... print row ... ... (1,) >>>

In our examples, we connected directly to the PRIMARY node of the global group in order to execute DDL statements, but the same can be done requesting a global connection to MySQL Fabric, like so:

fcnx.set_property(group="salaries-global",scope=fabric.SCOPE_GLOBAL,mode=fabric.MODE_READWRITE) cur = fcnx.cursor() cur.execute("USE employees") cur.execute("CREATE TABLE test (id int not null primary key)")

We can see that the table gets replicated as expected:

[vagrant@node3 ~]$ mysql -uroot -e 'show tables from employees' +---------------------+ | Tables_in_employees | +---------------------+ | salaries | | test | +---------------------+

Note that we’re explicitly indicating we want to connect to the global group here. When establishing a MySQL Fabric connection, we need to specify either a group name or a key and table pair (as in the insert example).


Today we’ve presented the basics of how MySQL Fabric can help you scale out by sharding, but we’ve intentionally left a few things out of the picture to keep this example simple.

In our next post, we’ll see how we can combine MySQL Fabric’s HA and sharding features, what support we have for shard operations and how HASH sharding works in MySQL Fabric.

The post Managing shards of MySQL databases with MySQL Fabric appeared first on MySQL Performance Blog.


Contact Us 24 Hours A Day
Support Contact us 24×7
Emergency? Contact us for help now!
Sales North America (888) 316-9775 or
(208) 473-2904
+44-208-133-0309 (UK)
0-800-051-8984 (UK Toll Free)
0-800-181-0665 (GER Toll Free)
More Numbers
Training (855) 55TRAIN or
(925) 271-5054


Share This