Buy Percona SupportBuy Now
Subscribe to Latest MySQL Performance Blog posts feed
Updated: 2 hours 48 min ago

Percona Live 2016: Running MongoRocks in Production

April 19, 2016 - 11:38pm

It’s been a long first day at Percona Live 2016, filled with awesome insight and exciting topics. I was able to get to one more lecture before calling quits. For the final talk I saw today I listened to Igor Canadi, Software Engineer at Facebook, Inc., discuss Running MongoRocks in Production.

Facebook has been running MongoDB 3.0 with RocksDB storage engine (MongoRocks) at Parse since March of last year (2015). At this talk, they wanted to share some lessons learned about running MongoRocks in production. Igor was able to provide some interesting war stories and talk about performance optimization. Along with a little bit about RocksDB internals and which counters are most important to watch for.

RocksDB compares favorably to both the MMAP and WiredTiger storage engines when it comes to large write workloads.

The audience came away from the talk ready to get their feet wet with MongoRocks.

Below is a quick chat I had with Igor about RocksDB and MongoDB:


See the rest of the Percona Live schedule here.


Percona Live 2016: High Availability Using MySQL Group Replication

April 19, 2016 - 11:16pm

Percona Live 2016 had a great first day, with an impressive number of speakers and topics. I was able to attend a session in the afternoon with Luis Soares, Principal Software Engineer at Oracle, on High Availability Using MySQL Group Replication. MySQL Group Replication is a MySQL plugin under development that brings together group communication techniques and database replication, providing both high availability (HA) and a multi-master update everywhere replication solution.

At MySQL Group Replication’s  core is a set of group communication primitives that act as the building blocks to creating reliable, consistent and dependable messaging between the servers in the group. This allows the set of MySQL servers to coordinate themselves and act as a consistently replicated state machine. As a consequence, the group itself is fault-tolerant, and so is the service it provides (i.e., the MySQL database service). The plugin also provides multi-master update-anywhere characteristics with automatic conflict detection and handling.

In this discussion, we learned about the technical details of the MySQL Group Replication plugin, and discussed how this fits into the overall picture of the MySQL HA. For instance, how it can be deployed together with MySQL Router to automate load balancing and failover procedures. We also discovered the newest enhancements and how to leverage them when deploying and experimenting with this plugin.

Listen to a brief chat I had with Luis on MySQL Group Replication:

Check out the rest of the conference schedule here.

Percona Live 2016: A quick chat with Bill Nye, the Science Guy!

April 19, 2016 - 5:18pm

Percona Live is humming along, and we had quite a whirlwind keynote session this morning. Bill Nye the Science Guy gave an amazing talk, Bill Nye’s Objective – Change the World, on how the process of science and critical thinking can help us not only be positive about the challenges we face in our world today, but also help us to come up with the big ideas we need to solve them. He discussed many topics, from how his parents met, their involvement in science (his mother worked on the Enigma Code in World War 2!), working at Boeing as an engineer, his involvement with Carl Sagan, and how he has worked to help harness renewable energy through solar panels, a solar water heater, and skylights at his own home in Studio City.

Bill Nye is also the CEO of The Planetary Society. The Planetary Society, founded in 1980 by Carl Sagan, Bruce Murray, and Louis Friedman, works to inspire and involve the world’s public in space exploration through advocacy, projects, and education. Today, The Planetary Society is the largest and most influential public space organization group on Earth.

After the talks, I was able to quickly catch Bill Nye and ask him a few questions.



Percona Live 2016: Let Robots Manage your Schema (without destroying all humans)!

April 19, 2016 - 4:55pm

We’re are rapidly moving through day one of the Percona Live Data Performance Conference, and I’m surrounded by incredibly smart people all discussing amazing database techniques and technology. The depth of solutions represented here, and the technical know-how needed to pull them off is astounding!

This afternoon I was able to catch Jenni Snyder, MySQL DBA  at Yelp deliver her talk on Let Robots Manage your Schema (without destroying all humans). While vaguely frightening, it was a fascinating talk on how automating schema changes helped Yelp’s development.

You’re probably already using automation to build your application, manage configuration, and alert you in case of emergencies. Jenni asks what’s keeping you from doing the same with your MySQL schema changes? For Yelp, the answer was “lots of things”. Today, Yelp uses Liquibase to manage their schema changes, pt-online-schema-change to execute them, and Jenkins to ensure that they’re run in all of their environments. During this session, she explained the history of MySQL schema management at Yelp, and how hard it was for both developers and DBAs.

Below is a video of her summarizing her team’s efforts and outcomes.

Check out the Percona Live schedule to see what is coming up next!


Percona Live 2016: Performance of Percona Server for MySQL on Intel Server Systems using HDD, SATA SSD, and NVMe SSD as Different Storage Mediums

April 19, 2016 - 2:33pm

We’re moving along on the first day at Percona Live 2016, and I was able to attend a lecture from Intel’s Ken LeTourneau, Solutions Architect at Intel, on Performance of Percona Server for MySQL on Intel Server Systems using HDD, SATA SSD, and NVMe SSD as Different Storage Mediums. In this talk, Ken reviewed some benchmark testing he did using MySQL on various types of storage mediums. This talk looked at the performance of Percona Server for MySQL for Linux running on the same Intel system, but with three different storage configurations. We looked at and compared the performance of:

  1. a RAID of HDD,
  2. a RAID of SATA SSD, and
  3. a RAID of NVMe SSD

In the talk,  Ken covered the hardware and system configuration and then discuss results of TPC-C and TPC-H benchmarks, as well as the overall system costs including hardware and software, and cost per transaction/query based on overall costs and benchmark results.

I got a chance to speak with Ken after his talk, check it out below!


Day One of the Percona Live Data Performance Conference 2016 is off and running!

April 19, 2016 - 1:35pm

Today was day one of the Percona Live Data Performance Conference! The day began with some excellent keynote speakers and exciting topics, and the packed room was eager to hear what our speakers had to say!

Peter Zaitsev, CEO, Percona
Percona Opening  Keynote
Peter kicked it off today by thanking the sponsors, the speakers, the Percona Live committee, and the attendees for contributing and participating in this year’s event. It has grown and changed quite a bit from its initial creation. Peter emphasized how this a gathering of members of a community, one that changes and adapts, and discusses and debates many different points of views and opinions. No longer is just a conference about MySQL, but now includes MongoDB, Cassandra, and many other solutions and products that are all a part of the open source community. The purpose of the conference is to provide open and diverse opinions, quality content, a technical focus, and useful and practical ideas and solutions.

Chad Jones, Chief Strategy Officer, Deep Information
Transcending database tuning problems: How machine learning helps DBAs play more ping pong

Next up was Chad Jones discussing how just as machine learning enables businesses to gain competitive advantage through predictive analytics, by looking deeper into the data stack we find the need for the same predictive capabilities for MySQL tuning. With over 10^13 possible tuning permutations, some requiring reboots or a rebuild, DBAs spend way too much time on MySQL tuning for a point-in-time situation that changes constantly. He demonstrated how unsupervised machine learning based on resource, workload and information modeling could predictively and continuously tune databases. DBAs can transcend the tuning game, saving precious time to work on important things, like improving your mad ping pong skills!

Bill Nye, The Planetary Society, CEO
Bill Nye’s Objective – Change the World
inally this morning, we were treated to an outstanding lecture from world-renown scientist and media personality Bill Nye the Science Guy. Bill spent his time discussing his life, how he came to love science, and the ability it brings to understand the world. His experiences as an engineer at Boeing helped him appreciate the value in investing time and money into excellent design strategies and processes. Through the power of critical thinking and science, we can embrace optimism in a world that has many touch challenges. Bill Nye fights to raise awareness of the value of science, critical thinking, and reason. He hopes that the data he brings will help inspire people everywhere to change the world!

Those were the morning lectures today! Such a great set of speakers, I can’t wait for tomorrow! Check out our schedule here.


Percona Server for MongoDB 3.2.4-1.0rc2 is now available

April 19, 2016 - 1:20pm

Percona is pleased to announce the release of Percona Server for MongoDB 3.2.4-1.0rc2 on April 19, 2016. Download the latest version from the Percona web site or the Percona Software Repositories.

Percona Server for MongoDB is an enhanced, open source, fully compatible, highly scalable, zero-maintenance downtime database supporting the MongoDB v3.0 protocol and drivers. This release candidate is based on MongoDB 3.2.4, it extends MongoDB with MongoRocks and PerconaFT storage engines, as well as adding features like external authentication and audit logging. Percona Server for MongoDB requires no changes to MongoDB applications or code.

NOTE: The MongoRocks storage engine is still under development. There is currently no officially released version of MongoRocks that can be recommended for production. Percona Server for MongoDB 3.2.4-1.0rc2 includes MongoRocks 3.2.4, which is based on RocksDB 4.4.

This release includes all changes from MongoDB 3.2.4, and there are no additional improvements or new features on top of those upstream fixes.

NOTE: As of version 3.2, MongoDB uses WiredTiger as the default storage engine, instead of MMAPv1.

Percona Server for MongoDB 3.2.4-1.0rc2 release notes are available in the official documentation.


Percona Monitoring and Management

April 18, 2016 - 2:09pm

Percona is excited to announce the launch of Percona Monitoring and Management Beta!

Percona Monitoring and Management (PMM) is a fully open source solution for both managing MySQL platform performance and tuning query performance. It allows DBAs and application developers to optimize the performance of the Database Layer. PMM is an on-premises solution that keeps all of your performance and query data inside the confines of your environment, with no requirement for any data to cross the internet.

Assembled from a supported package of “best of breed” open source tools such as Prometheus, Grafana and Percona’s Query Analytics, PMM delivers results right out of the box.

With PMM, anyone with database maintenance responsibilities can get more visibility for actionable enhancements, realize faster issue resolution times, increase performance through focused optimization, and better manage resources. More information allows you to concentrate efforts on the areas that yield the highest value, rather than hunting and pecking for speed.

PMM monitors and provides performance data for Oracle’s MySQL Community and Enterprise Editions as well as Percona Server for MySQL and MariaDB.

Download Percona Monitoring and Management now.


CPU and Load

Top 10 Queries

QAN Create Table

QAN per-query metrics

QAN table indexes

How Percona XtraDB Cluster certification works

April 17, 2016 - 4:40pm
In this blog, we’ll  discuss how Percona XtraDB Cluster certification works. Percona XtraDB Cluster replicates actions executed on one node to all other nodes in the cluster and make it fast enough to appear as it if is synchronous (aka virtually synchronous). Let’s understand all the things involved in the process (without losing data integrity). There are two main types of actions: DDL and DML. DDL actions are executed using Total Order Isolation (let’s ignore Rolling Schema Upgrade for now) and DML using normal Galera replication protocol. This blog assumes the reader is aware of Total Order Isolation and MySQL replication protocol.
  • DML (Insert/Update/Delete) operations effectively change the state of the database, and all such operations are recorded in XtraDB by registering a unique object identifier (aka key) for each change (an update or a new addition). Let’s understand this key concept in a bit more detail.
    • A transaction can change “n” different data objects. Each such object change is recorded in XtraDB using a so-call append_key operation. The append_key operation registers the key of the data object that has undergone a change by the transaction. The key for rows can be represented in three parts as db_name, table_name, and pk_columns_for_table (if pk is absent, a hash of the complete row is calculated). In short there is quick and short meta information that this transaction has touched/modified following rows. This information is passed on as part of the write-set for certification to all the nodes of a cluster while the transaction is in the commit phase.
    • For a transaction to commit it has to pass XtraDB-Galera certification, ensuring that transactions don’t conflict with any other changes posted on the cluster group/channel. Certification will add the keys modified by given the transaction to its own central certification vector (CCV), represented by cert_index_ng. If the said key is already part of the vector, then conflict resolution checks are triggered.
    • Conflict resolution traces reference the transaction (that last modified this item in cluster group). If this reference transaction is from some other node, that suggests the same data was modified by the other node and changes of that node have been certified by the local node that is executing the check. In such cases, the transaction that arrived later fails to certify.
  • Changes made to DB objects are bin-logged. This is the same as how MySQL does it for replication with its Master-Slave eco-system, except that a packet of changes from a given transaction is created and named as a write-set.
  • Once the client/user issues a “COMMIT”, XtraDB Cluster will run a commit hook. Commit hooks ensure following:
    • Flush the binlogs.
    • Check if the transaction needs replication (not needed for read-only transactions like SELECT).
    • If a transaction needs a replication, then it invokes a pre_commit hook in the Galera eco-system. During this pre-commit hook, a write-set is written in the group channel by a “replicate” operation. All nodes (including the one that executed the transaction) subscribes to this group-channel and reads the write-set.
    • gcs_recv_thread is first to receive the packet, which is then processed through different action handlers.
    • Each packet read from the group-channel is assigned an “id”, which is a locally maintained counter by each node in sync with the group. When any new node joins the group/cluster, a seed-id for it is initialized to the current active id from group/cluster. (There is an inherent assumption/protocol enforcement that all nodes read the packet from a channel in same order, and that way even though each packet doesn’t carry “id” information it is inherently established using the local maintained “id” value).
      /* Common situation - * increment and assign act_id only for totally ordered actions * and only in PRIM (skip messages while in state exchange) */ rcvd->id = ++group->act_id_; [This is an amazing way to solve the problem of the id co-ordination in multiple master system, otherwise a node will have to first get an id from central system or through a separate agreed protocol and then use it for the packet there-by doubling the round-trip time].
  • What happens if two nodes get ready with their packet at same time?
    • Both nodes will be allowed to put the packet on the channel. That means the channel will see packets from different nodes queued one-behind-another.
    • It is interesting to understand what happens if two nodes modify same set of rows. Let’s take an example:

     create -> insert (1,2,3,4)....nodes are in sync till this point. node-1: update i = i + 10; node-2: update i = i + 100; Let's associate transaction-id (trx-id) for an update transaction that is executed on node-1 and node-2 in parallel (The real algorithm is bit more involved (with uuid + seqno) but conceptually the same so for ease I am using trx_id) node-1: update action: trx-id=n1x node-2: update action: trx-id=n2x Both node packets are added to the channel but the transactions are conflicting. Let's see which one succeeds. The protocol says: FIRST WRITE WINS. So in this case, whoever is first to write to the channel will get certified. Let's say node-2 is first to write the packet and then node-1 makes immediately after it. NOTE: each node subscribes to all packages including its own package. See below for details. Node-2: - Will see its own packet and will process it. - Then it will see node-1 packet that it tries to certify but fails. (Will talk about certification protocol in little while) Node-1:  - Will see node-2 packet and will process it. (Note: InnoDB allows isolation and so node-1 can process node-2 packets independent of node-1 transaction changes) - Then it will see the node-1 packet that it tries to certify but fails. (Note even though the packet originated from node-1 it will under-go certification to catch cases like thes. This is beauty of listening to own events that make consistent processing path even if events are locally generated)

  • Now let’s talk about the certification protocol using the example sighted above. As discussed above, the central certification vector (CCV) is updated to reflect reference transaction.

Node-2: - node-2 sees its own packet for certification, adds it to its local CCV and performs certification checks. Once these checks pass it updates the reference transaction by setting it to "n2x" - node-2 then gets node-1 packet for certification. Said key is already present in CCV with a reference transaction set it to "n2x", whereas write-set proposes setting it to "n1x". This causes a conflict, which in turn causes the node-1 originated transaction to fail the certification test. This helps point out a certification failure and the node-1 packet is rejected. Node-1: - node-1 sees node-2 packet for certification, which is then processed, the local CCV is updated and the reference transaction is set to "n2x" - Using the same case explained above, node-1 certification also rejects the node-1 packet. Well this suggests that the node doesn't need to wait for certification to complete, but just needs to ensure that the packet is written to the channel. The applier transaction will always win and the local conflicting transaction will be rolled back.

  • What happens if one of the nodes has local changes that are not synced with group.

create (id primary key) -> insert (1), (2), (3), (4); node-1: wsrep_on=0; insert (5); wsrep_on=1 node-2: insert(5). insert(5) will generate a write-set that will then be replicated to node-1. node-1 will try to apply it but will fail with duplicate-key-error, as 5 already exist. XtraDB will flag this as an error, which would eventually cause node-1 to shutdown.

  • With all that in place, how is GTID incremented if all the packets are processed by all nodes (including ones that are rejected due to certification)? GTID is incremented only when the transaction passes certification and is ready for commit. That way errant-packets don’t cause GTID to increment. Also, don’t confuse the group packet “id” quoted above with GTID. Without errant-packets, you may end up seeing these two counters going hand-in-hand, but they are no way related.

The final three Database Performance Team characters are . . .

April 16, 2016 - 8:48am

The last email introduced two members of the Percona Database Performance Team: The Maven and The Specialist. Now we’re ready to reveal the identity of the final three team members.

The Database Performance Team is comprised of our services experts, who work tirelessly every day to guarantee the performance of your database. Percona’s support team is made up of superheroes that make sure your database is running at peak performance.

The third member is technical, possesses clairvoyant foresight, with the knowledge and statistics to account for all issues, and manages problems before they happen. Who is this champion?

The Clairvoyant
Percona Technical Account Manager
“Problems solved before they appear.”

The Clairvoyant predicts the future of technology and operations to head off major issues. Saves you before you even know there is a problem. With The Clairvoyant working with you, you know you’re going to be just fine.




The fourth member is remotely all-seeing, a director, good at multi-tasking, adapts-on-the-fly, and is cool in a crisis. Who is this champion?

The Maestro
Percona Remote DBA

“Just like that, optimized database.”

The Maestro is the head of the operation, a leader, a single-person think tank that controls all from their home base of operations. A cyborg, half-man half-machine. With the Maestro controlling your database, all your worries are through.




The fifth member is insanely strong, can’t be stopped, is hard to knock down, and the product of rigorous testing with unlimited endurance. Who is this champion?


The Powerhouse
Percona Software
“Performance Starts Here!”

Percona’s suite of MySQL and MongoDB software and toolkits are a powerhouse of performance, the backbone of the organization – they show unparalleled strength and dependability, with endurance to boot. As a product of the open source community, our software has been tested by fire and proven resilient.



Your Mission

Follow @Percona on Twitter and use hashtag “#DatabasePerformanceTeam #PerconaLive” and join us at The Percona Live Data Performance conference April 18-21for chances to win Database Performance Team member T-shirts! Collect them all! Stay tuned, as we will have more fun games for the Database Performance Team over the coming weeks!

Percona Live Update!

We know! We get it! It’s hard to plan with everything going on, and now you have to register for Percona Live at the last minute! Well, for once it pays off! The Percona Live Data Performance Conference will be April 18-21 at the Hyatt Regency Santa Clara & The Santa Clara Convention Center.

We have a special procrastination discount for everyone who waited until almost too late!Use discount code “procrastination” to get 15% off your full conference pass.

If you can only make it to the keynotes, BOFs, or Community Networking Reception, use discount code “KEY” for a $5 Expo Only pass.

And if you cannot make it this year, watch our blog following the conference and we’ll announce when and where the keynote recordings and breakout session slide decks can be found.

Let’s Get Social

Join the ranks of the Database Street Team! Fun games, cool prizes – more info is coming soon!

Connect with us on TwitterFacebook and LinkedIn for the latest news from Percona. And we invite you to report bugs, make suggestions or just get caught up on all of Percona’s projects on GitHub.

Webinar Q & A for Introduction to Troubleshooting Performance: What Affects Query Execution?

April 16, 2016 - 8:29am

In this blog, I will provide answers to the Webinar Q & A for Introduction to Troubleshooting Performance: What Affects Query Execution?

First, I want to thank you for attending the April, 7 webinar. This webinar is the third in the “MySQL Troubleshooting” webinar series and last introductory webinar in the series. The recording and slides for the webinar are available here. Here is the list of your questions that I wasn’t able to answer during the webinar, with responses:

Q: If we had some MyISAM tables, could we use clustering in MariaDB?

A: Do you mean Galera Cluster? You can use it, but keep in mind what MyISAM support in Galera is still experimental and not recommended to use in production. You need to enable variable wsrep_replicate_myisam to enable MyISAM support.

Q: Is there a filesystem type that is better than another?

A: If you’re speaking about modern versus old file systems, modern is usually better. For example, ext3 or NTFS is certainly better than FAT32, which does not support files larger than 4GB. But that type of file system can be very helpful when you need to store data on a flash drive that can be read by any computer or mobile device. The same advice applies to ext2.

So a better answer to this question is it depends on your purpose. You can start from these Percona Live Slides by Ammon Sutherland, who describes the difference between file systems, their options and how they can improve or not MySQL performance. Then check this blog post by Dimitri Kravtchuk, and this one by Yves Trudeau.

Certainly NFS is not a good choice because it does not provide MySQL storage engines (particularly InnoDB) a reliable answer if data was really flushed to disk.

Q: Can I download this training somewhere on your website?

A: This webinar and all future webinars are available at the same place you registered. In this case, links to slides and recorded webinar are available here.

Q: What are good system level tools that are helpful?

A: In the webinar I presented a minimal set of tools that collect essential information, necessary for troubleshooting MySQL performance issues. Usually, you can start with them, then consult Brendan D. Gregg’s great picture. Regarding which tools which we like in Percona: a favorite is certainly perf.

Q: I am planning to use Percona Toolkit to optimize MySQL query performance.  When I tried to install Percona binaries, there is a conflict on MySQL binaries. Could you please help with how to install Percona binaries. Requesting for prerequisite?

A: A common reason for such conflicts is the client or shared libraries. You just need to replace them with Percona’s equivalent.

Q: How do you increase buffers for a high load case?

A: Buffers are linked to MySQL system variables. To increase them try SET variable_name=NEW_VALUE first, then test in the same session. If you are not happy with the result, increase the global buffer: SET GLOBAL variable_name=NEW_VALUE, then test the performance of the whole database. If you are still not happy with the result, adjust the variable value in your configuration file. Of course, you don’t need to try to set a variable that has global scope. Try setting only in the session first. Sometimes you cannot change the variable value online. In this case, be especially careful: test first, be ready to rollback changes, choose a less busy time, etc.

Q: How do you handle deadlocks?

Q: Which are the best practices to fix performance caused by InnoDB deadlocks?

A: These two questions are about practically same thing. Deadlocks are not 100% avoidable in InnoDB, therefore the best solution is to code the application in such a way that it can simply re-run transactions that were rolled back. But if you see deadlocks too often, this is not always possible. In such cases, you need to investigate which rows are locked by each transaction, find out why this pattern repeats and fix your transactions or tables (sometimes if you search without indexes, a query can lock more rows than needed to resolve the query – adding an appropriate index can solve locking issue).

Q: Is it important in MySQL 5.7 to separate logs and data in differents disks or partitions?

A: By separating data and logs on different disk partitions you gain performance because you can write more data in parallel and it’s more stable. In the case of a data disk breaking, you will have log files untouched and can restore data from them. But this only applies to cases when partitions are on different physical disks.

Q: When are we going to see something like Datastax OpsCenter for MariaDB, with nice performance monitoring and support for easy partitioning?

A: There are many monitoring and administration tools for MySQL and MariaDB, which include VividCortex, SolarWinds, Webyog, MySQL Enterprise Monitor, MySQL Workbench, etc. Please specify in comments which feature in Datastax OpsCenter you miss in these products. I can probably answer if there is an alternative. I don’t know about any plans for cloning Datastax OpsCenter for MariaDB or MySQL.

Q: If you are storing queries in stored procedures, and you make changes to those SP’s, how long will it take for them to be cached? The next time they are run, or after x amount of times?

A: What do you mean queries would be cached? If you mean MySQL Query Cache: the call of the SP will be in cache, and the result will be reloaded next time the procedure is called. If you mean the data retrieved by these queries, if it is stored in the InnoDB buffer pool or not it is same: the next time when you call the SP, new data will be in the buffer pool.

Q: I have a very big table (up to 9GB data), and it is a very heavy read/write table. We actually store all our chat messages in that table: every record in the table is a row. What would be the best way to get out of this situation, NoSQL or partition? Will Percona be helpful for me in this situation?

A: This depends on your table definition and how you use it. It is hard to answer having only this description of your issue. Regarding Percona help, this looks like a case for our Support or Consulting.

Q: I want to archive data on a daily basis. I use INSERT INTO table_achive SELECT * FROM table. This takes about 45 minutes for 7.000.000 rows. Is that slow?

A: As I mentioned at the beginning of the webinar, there is no “yes” or “no” answer. It depends on the speed of your disk, how large the table is (retrieving 7,000,000 rows which contain only two integer columns would be certainly faster than retrieving 7,000,000 rows, each of them has maybe 42 columns). But what I can certainly say is that most likely this dataset does not fit into your memory, and this query requires you to create disk-based temporary tables. This query most likely sets too many locks and can slow down other queries on the “table”. If all this concerns you, consider copying data in chunks: for example  INSERT INTO table_achive SELECT * FROM table WHERE Primary_Key BETWEEN start AND end. You can use utility pt-online-schema-change as a guide.


Creating Geo-Enabled applications with MongoDB, GeoJSON and MySQL

April 15, 2016 - 11:58am

This blog post will discuss creating geo-enabled applications with MongoDB, GeoJSON and MySQL.


Recently I published a blog post about the new GIS features in MySQL 5.7. Today I’ve looked into how to use MongoDB (I’ve tested with 3.0 and 3.2, with 3.2 being much faster) for the same purpose. I will also talk about GIS in MySQL and MongoDB at Percona Live next week (together with my colleague Michael Benshoof).

MongoDB and GIS

MongoDB has a very useful feature called “geoNear.” There are other MongoDB spatial functions available to calculate the distance on a sphere (like the Earth), i.e. $nearSphere , $centerSphere, $near – but all of them have restrictions. The most important one is that they do not support sharding. The geoNear command in MongodDB, on the other hand, supports sharding. I will use geoNear in this post.

For this test, I exported Open Street Map data from MySQL to MongoDB (see the “Creating GEO-enabled applications with MySQL 5.6” post for more details on how to load this data to MySQL).

  1. Export the data to JSON. In MySQL 5.7, we can use JSON_OBJECT to generate the JSON file:
    mysql> SELECT JSON_OBJECT('name', replace(name, '"', ''), 'other_tags', replace(other_tags, '"', ''), 'geometry', st_asgeojson(shape)) as j FROM `points` INTO OUTFILE '/var/lib/mysql-files/points.json'; Query OK, 13660667 rows affected (4 min 1.35 sec)
  2. Use mongoimport  to import JSON into MongoDB (I’m using 24 threads, -j 24, to use parallel import):
    mongoimport --db osm --collection points -j 24 --file /var/lib/mysql-files/points.json 2016-04-11T22:38:10.029+0000 connected to: localhost 2016-04-11T22:38:13.026+0000 [........................] osm.points 31.8 MB/2.2 GB (1.4%) 2016-04-11T22:38:16.026+0000 [........................] osm.points 31.8 MB/2.2 GB (1.4%) 2016-04-11T22:38:19.026+0000 [........................] osm.points 31.8 MB/2.2 GB (1.4%) … 2016-04-11T23:12:13.447+0000 [########################] osm.points 2.2 GB/2.2 GB (100.0%) 2016-04-11T23:12:15.614+0000 imported 13660667 documents
  3. Create a 2d index:
    mongo > use osm switched to db osm > db.points.createIndex({ geometry : "2dsphere" } ) { "createdCollectionAutomatically" : false, "numIndexesBefore" : 1, "numIndexesAfter" : 2, "ok" : 1 }

Another option would be using the osm2mongo Ruby script, which will convert the osm file and load it directly to MongoDB.

Now I can use the geoNear command to find all the restaurants near my location:

> db.runCommand( { geoNear: "points", near: { type: "Point" , coordinates: [ -78.9064543, 35.9975194 ]}, spherical: true, ... query: { name: { $exists: true, $ne:null}, "other_tags": { $in: [ /.*amenity=>restaurant.*/, /.*amenity=>cafe.*/ ] } }, "limit": 5, "maxDistance": 10000 } ) { "results" : [ { "dis" : 127.30183814835166, "obj" : { "_id" : ObjectId("570329164f45f7f0d66f8f13"), "name" : "Pop's", "geometry" : { "type" : "Point", "coordinates" : [ -78.9071795, 35.998501 ] }, "other_tags" : "addr:city=>Durham,addr:country=>US,addr:housenumber=>605,addr:street=>West Main Street,amenity=>restaurant,building=>yes" } }, { "dis" : 240.82201047521244, "obj" : { "_id" : ObjectId("570329df4f45f7f0d68c16cb"), "name" : "toast", "geometry" : { "type" : "Point", "coordinates" : [ -78.9039761, 35.9967069 ] }, "other_tags" : "addr:full=>345 West Main Street, Durham, NC 27701, US,amenity=>restaurant,website=>" } }, ... }

MongoDB 3.0 vs 3.2 with geoNear

MongoDB 3.2 features Geospatial Optimization:

MongoDB 3.2 introduces version 3 of 2dsphere indexes, which index GeoJSON geometries at a finer gradation. The new version improves performance of 2dsphere index queries over smaller regions. In addition, for both 2d indexes and 2dsphere indexes, the performance of geoNear queries has been improved for dense datasets.

I’ve tested the performance of the above geoNear query with MongoDB 3.0 and MongoDB 3.2 (both the old and new versions of 2dsphere index). All the results statistics are for a "limit": 5 and "maxDistance": 10000.

MongoDB 3.0, index version 2:

> db.points.getIndexes() ... { "v" : 1, "key" : { "geometry" : "2dsphere" }, "name" : "geometry_2dsphere", "ns" : "osm.points", "2dsphereIndexVersion" : 2 } ] "stats" : { "nscanned" : 1728, "objectsLoaded" : 1139, "avgDistance" : 235.76379903759667, "maxDistance" : 280.2681226202938, "time" : 12 },

MongoDB 3.2, index version 2:

> db.points.getIndexes() [ ... { "v" : 1, "key" : { "geometry" : "2dsphere" }, "name" : "geometry_2dsphere", "ns" : "osm.points", "2dsphereIndexVersion" : 2 } ] ... "stats" : { "nscanned" : 513, "objectsLoaded" : 535, "avgDistance" : 235.76379903759667, "maxDistance" : 280.2681226202938, "time" : 5 },

What is interesting here is that even with the "2dsphereIndexVersion" : 2, MongoDB 3.2 performs much faster and scans a much smaller number of documents.

MongoDB 3.2, index version 3:

> db.points.dropIndex("geometry_2dsphere") { "nIndexesWas" : 2, "ok" : 1 } > db.points.createIndex({ geometry : "2dsphere" } ) { "createdCollectionAutomatically" : false, "numIndexesBefore" : 1, "numIndexesAfter" : 2, "ok" : 1 } > db.points.getIndexes() [ { "v" : 1, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "osm.points" }, { "v" : 1, "key" : { "geometry" : "2dsphere" }, "name" : "geometry_2dsphere", "ns" : "osm.points", "2dsphereIndexVersion" : 3 } ] "stats" : { "nscanned" : 144, "objectsLoaded" : 101, "avgDistance" : 235.76379903759667, "maxDistance" : 280.2681226202938, "time" : 1 },

That is significantly faster, 1ms for five results!

MySQL and GeoJSON revisited

To compare to the performance of the above query, I’ve created a similar query in MySQL. First of all, we will need to use the good old bounding rectangle (envelope) trick to only include the points in the 10 miles radius (or so). If we don’t, MySQL will not be able to use spatial (RTREE) index. I’ve created the following function to generate the envelope:

DELIMITER // CREATE DEFINER = current_user() FUNCTION create_envelope(lat decimal(20, 14), lon decimal(20, 14), dist int) RETURNS geometry DETERMINISTIC begin declare point_text varchar(255); declare l varchar(255); declare p geometry; declare env geometry; declare rlon1 double; declare rlon2 double; declare rlat1 double; declare rlat2 double; set point_text = concat('POINT(', lon, ' ', lat, ')'); set p = ST_GeomFromText(point_text, 1); set rlon1 = lon-dist/abs(cos(radians(lat))*69); set rlon2 = lon+dist/abs(cos(radians(lat))*69); set rlat1 = lat-(dist/69); set rlat2 = lat+(dist/69); set l = concat('LineString(', rlon1, ' ', rlat1, ',', rlon2 , ' ', rlat2, ')'); set env= ST_Envelope(ST_GeomFromText(l, 1)); return env; end // DELIMITER ; mysql> set @lat= 35.9974043; Query OK, 0 rows affected (0.00 sec) mysql> set @lon = -78.9045615; Query OK, 0 rows affected (0.00 sec) mysql> select st_astext(create_envelope(@lat, @lon, 10)); +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | st_astext(create_envelope(@lat, @lon, 10)) | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | POLYGON((-79.08369589058249 35.852476764,-78.72542710941751 35.852476764,-78.72542710941751 36.142331836,-79.08369589058249 36.142331836,-79.08369589058249 35.852476764)) | +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.00 sec)

Then we can use the following query (an update of the GeoJSON query from my previous post):

set @lat= 35.9974043; set @lon = -78.9045615; set @p = ST_GeomFromText(concat('POINT(', @lon, ' ', @lat, ')'), 1); set group_concat_max_len = 100000000; SELECT CONCAT('{ "type": "FeatureCollection", "features": [ ', GROUP_CONCAT('{ "type": "Feature", "geometry": ', ST_AsGeoJSON(shape), ', "properties": {"distance":', st_distance_sphere(shape, @p) , ', "name":"', name , '"} }' order by st_distance_sphere(shape, @p)), '] }') as j FROM points_new WHERE st_within(shape, create_envelope(@lat, @lon, 10)) and (other_tags like '%"amenity"=>"cafe"%' or other_tags like '%"amenity"=>"restaurant"%') and name is not null and st_distance_sphere(shape, @p) < 1000; ... 1 row in set (0.04 sec)

The time is slower: 40ms in MySQL compared to 1ms – 12ms in MongoDB. The box is AWS EC2 t2.medium.

To recap the difference between MongoDB geoNear and MySQL st_distance_sphere:

  • MongoDB geoNear uses 2dsphere index, so it is fast; however, it can’t just calculate the distance between two arbitrary points
  • MySQL st_distance_sphere is a helper function and will only calculate the distance between two points; it will not use an index – we will have to use the create_envelope function to restrict the search so MySQL will use an index

Time-wise, this is not an apples to apples comparison as the query is quite different and uses a different technique.

Visualizing the results

Results for GeoJSON for Google Maps API:

{ "type": "FeatureCollection", "features": [ { "type": "Feature", "geometry": {"type": "Point", "coordinates": [-78.9036457, 35.997125]}, "properties": {"distance":87.67869122893659, "name":"Pizzeria Toro"} },{ "type": "Feature", "geometry": {"type": "Point", "coordinates": [-78.9039761, 35.9967069]}, "properties": {"distance":93.80064086575564, "name":"toast"} },{ "type": "Feature", "geometry": {"type": "Point", "coordinates": [-78.9031929, 35.9962871]}, "properties": {"distance":174.8300018385443, "name":"Dame's Chicken and Waffles"} }, ... }

Now we can add those on a map:

Back to MongoDB: pluses and minuses

MongoDB uses Google’s S2 library to perform GIS calculations. The geoNear command is fast and easy to use for finding points of interests near you (which is the most common operation). However, full GIS support does not natively exist.

Another issue I came across when creating a 2dsphere index: MongoDB is very strict when checking the lines and polygons. For example:

> db.lines.createIndex({ geometry : "2dsphere" } ) { "createdCollectionAutomatically" : false, "numIndexesBefore" : 1, "errmsg" : "exception: Can't extract geo keys: { _id: ObjectId('570308864f45f7f0d6dfbed2'), name: "75 North", geometry: { type: "LineString", coordinates: [ [ -85.808852, 41.245582 ], [ -85.808852, 41.245582 ] ] }, other_tags: "tiger:cfcc=>A41,tiger:county=>Kosciusko, IN,tiger:name_base=>75,tiger:name_direction_suffix=>N,tiger:reviewed=>no" } GeoJSON LineString must have at least 2 vertices: [ [ -85.808852, 41.245582 ], [ -85.808852, 41.245582 ] ]", "code" : 16755, "ok" : 0 }

MongoDB complains about this: type: “LineString”, coordinates: [ [ -85.808852, 41.245582 ], [ -85.808852, 41.245582 ] ]

This is a “bad” line string as the starting point and ending point are the same. I had to remove the bad data from my MongoDB imported dataset, which was tricky by itself. MongoDB (as opposed to MySQL) does not have a native way to compare the values inside the JSON, so I had to use $where construct – which is slow and acquires a global lock:

> db.lines.remove({"geometry.type": "LineString", "geometry.coordinates": {$size:2}, $where: "this.geometry.coordinates[0][0] == this.geometry.coordinates[1][0] && this.geometry.coordinates[0][1] == this.geometry.coordinates[1][1]" }) WriteResult({ "nRemoved" : 22 })

After that, I was able to add the 2dsphere index.


MongoDB looks good, is pretty fast and easy for geo-proximity search queries – until you go outside of the one function and need full GIS support (which does not natively exist). It may be trickier to implement other GIS functions like st_contains or st_within.

Update: as pointed out, MongoDB actually supports $geoWithin and $geoIntersects GIS functions.

Update 2: I was asked about MySQL and GeoJSON: why not to use the ST_MakeEnvelope function. One of the issues with ST_MakeEnvelope is that it only works with SRID 0 (it requires point geometry arguments with an SRID of 0) and OSM data is stored with SRID 1. But also I will need to “add” 10 miles to my point. The only way to do that is to calculate the new point, 10 miles apart from “my” point/location. I have to use a custom function to manipulate the lat/lon pair.

The explain plan for the MySQL GeoJSON query shows that MySQL uses SHAPE (Spatial) index:

mysql> explain SELECT CONCAT('{ "type": "FeatureCollection", "features": [ ', GROUP_CONCAT('{ "type": "Feature", "geometry": ', ST_AsGeoJSON(shape), ', "properties": {"distance":', st_distance_sphere(shape, @p) , ', "name":"', name , '"} }' order by st_distance_sphere(shape, @p)), '] }') as j FROM points_new WHERE st_within(shape, create_envelope(@lat, @lon, 10)) and (other_tags like '%"amenity"=>"cafe"%' or other_tags like '%"amenity"=>"restaurant"%') and name is not null and st_distance_sphere(shape, @p) < 1000; *************************** 1. row *************************** id: 1 select_type: SIMPLE table: points_new partitions: NULL type: range possible_keys: SHAPE key: SHAPE key_len: 34 ref: NULL rows: 665 filtered: 18.89 Extra: Using where 1 row in set, 1 warning (0.00 sec)

And if I remove “st_within(shape, create_envelope(@lat, @lon, 10))” from the query it will show the full table scan:

mysql> explain SELECT CONCAT('{ "type": "FeatureCollection", "features": [ ', GROUP_CONCAT('{ "type": "Feature", "geometry": ', ST_AsGeoJSON(shape), ', "properties": {"distance":', st_distance_sphere(shape, @p) , ', "name":"', name , '"} }' order by st_distance_sphere(shape, @p)), '] }') as j FROM points_new WHERE (other_tags like '%"amenity"=>"cafe"%' or other_tags like '%"amenity"=>"restaurant"%') and name is not null and st_distance_sphere(shape, @p) < 1000; *************************** 1. row *************************** id: 1 select_type: SIMPLE table: points_new partitions: NULL type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 11368798 filtered: 18.89 Extra: Using where 1 row in set, 1 warning (0.00 sec)


MySQL Document Store Developments

April 15, 2016 - 10:04am

This blog will discuss some recent developments with MySQL document store.

Starting MySQL 5.7.12, MySQL can be used as a real document store. This is great news!

In this blog post, I am going to look into the history-making MySQL work better for “NoSQL” workloads and more of the details on what MySQL document store offers at this point.

First, the idea of using reliable and high-performance MySQL storage engines for storing or accessing non-relational data through SQL is not new.

Previous Efforts

MyCached (Memcache protocol support for MySQL) was published back in 2009. In 2010 we got HandlerSocket plugin, providing better performance and a more powerful interface. 2011 introduced both MySQL Cluster (NDB) support for MemcacheD protocol and MemcacheD access to InnoDB tables as part of MySQL 5.6.

Those efforts were good, but focused a rear-window view. They provided a basic (though high-performance) Key-Value interface, but many developers needed both the flexibility of unstructured data and the richness inherent in structured data (as seen in document store engines like MongoDB).

When the MySQL team understood the needs, MySQL 5.7 (the next GA after 5.6) shipped with excellent features like JSON documents support, allowing you to mix structured and unstructured data in the same applications. This support includes indexes on the JSON field as well as an easy reference “inside” the document fields from applications.

MariaDB 5.3 attempted to support JSON functionality with dynamic columns. More JSON functions were added in MariaDB 10.1, but both these implementations were not as well done or as integrated as in MySQL 5.7 – they have a rushed feel to them. The plan is for MariaDB 10.2 to catch up with MySQL 5.7.  

JSON in SQL databases is still a work in progress, and there is no official standard yet. As of right now different DBMSs implement it differently, and we’ve yet to see how a standard MySQL implementation will look.

MySQL as a Document Store

Just as we thought we would have wait for MySQL 5.8 for future “NoSQL” improvements, the MySQL team surprised us by releasing MySQL 5.7.12 with a new “X Plugin.” This plugin allows us to use MySQL as a document store and avoid using SQL when a different protocol would be a better fit.

Time will tell whether the stability and performance of this very new plugin are any good – but it’s definitely a step in the right direction!

Unlike Microsoft DocumentDB, the MySQL team choose not to support the MongoDB protocol at this time. Their protocol, however, looks substantially inspired by MongoDB and other document store databases. There are benefits and drawbacks to this approach. On the plus side, going with your own syntax and protocol allows you to support a wealth of built-in MySQL functions or transactions that are not part of the MongoDB protocol. On the other hand, it also means you can’t just point your MongoDB application to MySQL and have it work.  

In reality, protocol level compatibility at this level usually ends up working only for relatively simple applications. Complex applications often end up relying on not-well-documented side effects or specific performance properties, requiring some application changes anyway.

The great thing about MySQL document store is that it supports transactions from the session start. This is important for users who want to use document-based API, but don’t want to give up the safety of data consistency and ACID transactions.

The new MySQL 5.7 shell provides a convenient command line interface for working with document objects, and supports scripting with SQL, JavaScript and Python.

The overall upshot of this effort is that developers familiar with MySQL, who also need document store functionality, will be able to continue using MySQL instead of adding MongoDB (or some other document store database) to the mix in their environments.

Make no mistake though: this is an early effort in the MySQL ecosystem! MongoDB and other companies have had a head start of years! Their APIs are richer (in places), supported by more products and frameworks, better documented and more understood by the community in general,  and are generally more mature.

The big question is when will the MySQL team be able to focus their efforts on making document-based APIs a “first-class citizen” in the MySQL ecosystem? As an example, they need to ensure stable drivers exist for a wide variety of languages (currently, the choice is pretty limited).

It would also be great to see MySQL go further by taking on other areas that drive the adoption of NoSQL systems – such as the easy way they achieve high availability and scale. MySQL’s replication and manual sharding were great in the early 2000s, but is well behind modern ease-of-use and dynamic scalability requirements.

Want to learn more about this exciting new development in MySQL 5.7? Join us at Percona Live! Jan Kneschke, Alfredo Kojima, Mike Frank will provide an overview of MySQL document store as well as share internal implementation details.

Orchestrator-agent: How to recover a MySQL database

April 13, 2016 - 5:42pm

In our previous post, we showed how Orchestrator can handle complex replication topologies. Today we will discuss how the Orchestrator-agent compliments Orchestrator by monitoring our servers, and provides us a snapshot and recovery abilities if there are problems.

Please be aware that the following scripts and settings in this post are not production ready (missing error handling, etc.) –  this post is just a proof of concept.

What is Orchestrator-agent?

Orchestrator-agent is a sub-project of Orchestrator. It is a service that runs on the MySQL servers, and it gives us the seeding/deploying capability.

In this context “seeding” means copying MySQL data files from a donor server to the target machine. Afterwards, the MySQL can start on the target machine and use the new data files. 

Functionalities (list from Github):
  • Detection of the MySQL service, starting and stopping (start/stop/status commands provided via configuration)
  • Detection of MySQL port, data directory (assumes configuration is /etc/my.cnf)
  • Calculation of disk usage on data directory mount point
  • Tailing the error log file
  • Discovery (the mere existence of the orchestrator-agent service on a host may suggest the existence or need of existence of a MySQL service)
  • Detection of LVM snapshots on MySQL host (snapshots that are MySQL specific)
  • Creation of new snapshots
  • Mounting/umounting of LVM snapshots
  • Detection of DC-local and DC-agnostic snapshots available for a given cluster
  • Transmitting/receiving seed data

The following image shows us an overview of a specific host (click on an image to see a larger version):

How does it work?

The Orchestrator-agent runs on the MySQL server as a service, and it connects to Orchestrator through an HTTP API. Orchestrator-agent is controlled by Orchestrator. It uses and is based on LVM and LVM snapshots: without them it cannot work.

The agent requires external scripts/commands example:

  • Detect where in the local and remote DCs it can find an appropriate snapshot
  • Find said snapshot on server, mount it
  • Stop MySQL on target host, clear data on MySQL data directory
  • Initiate send/receive process
  • Cleanup data after copy

If these external commands are configured, a snapshot can be created through the Orchestrator web interface example. The agent gets the task through the HTTP API and will call an external script, which creates a consistent snapshot.

Orchestrator-agent configuration settings

There are many configuration options, some of which we’ll list here:

  • SnapshotMountPoint – Where should the agent mount the snapshot.
  • AgentsServer  –  Where is the AgentServer example: “” .
  • CreateSnapshotCommand  – Creating a consistent snapshot.
  • AvailableLocalSnapshotHostsCommand  – Shows us the available snapshots on localhost.
  • AvailableSnapshotHostsCommand  – Shows us the available snapshots on remote hosts.
  • SnapshotVolumesFilter  – Free text which identifies MySQL data snapshots.
  • ReceiveSeedDataCommand  – Command that receives the data.
  • SendSeedDataCommand  – Command that sends the data.
  • PostCopyCommand  – Command to be executed after the seed is complete.
Example external scripts

As we mentioned before, these scripts are not production ready.

"CreateSnapshotCommand": "/usr/local/orchestrator-agent/",

#!/bin/bash donorName='MySQL' snapName='my-snapshot' lvName=`lvdisplay | grep "LV Path" | awk '{print $3}'|grep $donorName` size='500M' dt=$(date '+%d_%m_%Y_%H_%M_%S'); mysql -e"STOP SLAVE; FLUSH TABLES WITH READ LOCK;" lvcreate --size $size --snapshot --name orc-$snapName-$dt $lvName mysql -e"UNLOCK TABLES;START SLAVE;"

This small script creates a consistent snapshot the agent can use later.

"AvailableLocalSnapshotHostsCommand": "lvdisplay | grep "LV Path" | awk '{print $3}'|grep my-snapshot",

We can filter the available snapshots based on the “SnapshotVolumesFilter” string.

"AvailableSnapshotHostsCommand": "echo rep4",

You can define a command that can show where the available snapshots in your topology are, or you can use a dedicated slave. In our test, we easily used a dedicated server.

"SnapshotVolumesFilter": "-my-snapshot",

“-my-snapshot” is the filter here.

"ReceiveSeedDataCommand": "/usr/local/orchestrator-agent/",

#!/bin/bash directory=$1 SeedTransferPort=$2 echo "delete $directory" rm -rf $directory/* cd $directory/ echo "Start nc on port $SeedTransferPort" `/bin/nc -l -q -1 -p $SeedTransferPort | tar xz` rm -f $directory/auto.cnf echo "run chmod on $directorty" chown -R mysql:mysql $directory

The agent passes two parameters to the script, then it calls the script like this:

/usr/local/orchestrator-agent/ /var/lib/mysql/ 21234

The script cleans the folder (you can not start while mysqld is running; first you have to stop it on the web interface or command line), listens on the specified port and it waits for the compressed input. After it removes “auto.cnf”,  MySQL recreates a new UUID at start time. Finally, make sure every file has the right owner.

"SendSeedDataCommand": "/usr/local/orchestrator-agent/",

#!/bin/bash directory=$1 targetHostname=$2 SeedTransferPort=$3 cd $directory echo "start nc" `/bin/tar -czf - -C $directory . | /bin/nc $targetHostname $SeedTransferPort`

The agent passes three parameters to the script:

/usr/local/orchestrator-agent/ /tmp/MySQLSnapshot rep5 21234

The first parameter is the mount point of the snapshot, and the second one is the destination host and the port number. The script easily compresses the data and sends it through “nc”.

Job details

A detailed log can be found from every seed. These logs can be really helpful in discovering any problems during the seed.

Why do we need Orchestrator-agent?

If you have a larger MySQL topology where you frequently have to provide new servers, or if you have a dev/staging replica set where you want to easily go back to a previous production stage, Orchestrator-agent can be really helpful and save you a lot of time. Finally, you’ll have the time for other fun and awesome stuff!

Features requests

Orchestrator-agent does its job, but adding a few extra abilities could make it even better:

  • Adding XtraBackup support.
  • Adding Mysqlpump/Mysqldump/Mydumper support.
  • Implement some kind of scheduling.
  • Batch seeding (seeding to more than one server with one job.)

Shlomi did a great job again, just like with Orchestrator.

Orchestrator and Orchestrator-agent together give us a useful platform to manage our topology and deploy MySQL data files to the new servers, or re-sync old ones.

Evaluating Database Compression Methods: Update

April 13, 2016 - 1:12pm

This blog post is an update to our last post discussing database compression methods, and how they stack up against each other. 

When Vadim and I wrote about Evaluating Database Compression Methods last month, we claimed that evaluating database compression algorithms was easy these days because there are ready-to-use benchmark suites such as lzbench.

As easy as it was to do an evaluation with this tool, it turned out it was also easy to make a mistake. Due to a bug in the benchmark we got incorrect results for the LZ4 compression algorithm, and as such made some incorrect claims and observations in the original article. A big thank you to Yann Collet for reporting the issue!

In this post, we will restate and correct the important observations and recommendations that were incorrect in the last post. You can view the fully updated results in this document.

As you can see above, there was little change in compression performance. LZ4 is still the fastest, though not as fast after correcting the issue.

The compression ratio is where our results changed substantially. We reported LZ4 achieving a compression ratio of only 1.89 — by far lowest among compression engines we compared. In fact, after our correction, the ratio is 3.89 — better than Snappy and on par with QuickLZ (while also having much better performance).  

LZ4 is a superior engine in terms of the compression ratio achieved versus the CPU spent.

The compression versus decompression graph now shows LZ4 has the highest ratio between compression and decompression performance of the compression engines we looked at.

The compression speed was not significantly affected by the LZ4 block size, which makes it great for compressing both large and small objects. The highest compression speed achieved was with a block size of 64KB — not the highest size, but not the smallest either among the sizes tested.

We saw some positive impact on the compression ratio by increasing the block size, However, increasing the block size over 64K did not substantially improve the compression ratio, making 64K an excellent block for LZ4, where it had the best compression speed and about as-good-as-it-gets compression. A 64K block size works great for other data as well, though we can’t say how universal it is.

Updated Recommendations

Most of our recommendations still stand after reviewing the updated results, with one important change. If you’re looking for a fast compression algorithm that has decent compression, consider LZ4.  It offers better performance as well as a better compression ratio, at least on the data sets we tested.


Percona Live featured talk with Ying Qiang Zhang — What’s new in AliSQL: Alibaba’s branch of MySQL

April 12, 2016 - 1:04pm

Welcome to the next Percona Live featured talk with Percona Live Data Performance Conference 2016 speakers! In this series of blogs, we’ll highlight some of the speakers that will be at this year’s conference, as well as discuss the technologies and outlooks of the speakers themselves. Make sure to read to the end to get a special Percona Live registration bonus!

In this Percona Live featured talk, we’ll meet Ying Qiang Zhang, Database Kernel Expert for the Alibaba Group. His talk will be What’s new in AliSQL — Alibaba’s branch of MySQL. This session introduces the Alibaba Group’s branch of the Oracle MySQL — AliSQL. In this session, we will learn about how AliSQL can support 140,000 order creations per second.  I had a chance to speak with Ying and learn a bit more about AliSQL:

PerconaGive me a brief history of yourself: how you got into database development, where you work, what you love about it.

Ying: My first step in my MySQL journey began in my graduate student period. I participated in lots of projects using MySQL as storage. At that time, I thought MySQL was a masterpiece of computer science theory and a well-demonstrated engineering implementation of a real RDBMS. By referencing MySQL source code, I solved many problems in my project.

Before joining the Alibaba group, I was a MySQL Kernel developer at Baidu Co., Ltd. I joined the Alibaba group in 2014 as the developer and maintainer of AliSQL, a MySQL fork of Alibaba.

The dramatic growth of Alibaba’s E-Commerce business puts extremely harsh demands on our database system. AliSQL faces bigger and bigger challenges. I like the challenges, and have tried my best to make AliSQL faster, safer and more scalable – which in turn makes our OLTP system more efficient and smooth.

Percona: Your talk is going to be on “What’s new in AliSQL – Alibaba’s branch of MySQL” So this was a version of MySQL that was put together specifically for the Alibaba group’s online commerce? What prompted that need and why a special MySQL implementation?

Ying: AliSQL is a fork of MySQL integrated with Alibaba’s business characteristics and requirements (based on a community version). The primary incentive of maintaining this fork are:

  1. As the largest E-Commerce platform in the world, the throughput of Alibaba’s online transaction processing system is huge, especially on days like Alibaba Singles’ Day shopping festival (China’s version of “Cyber Monday” or “Black Friday”). The databases behind the OLTP system faces the challenge of high throughput, high concurrency and low latency at the same time (requirements the community version of MySQL cannot meet).
  2. Under the high-stress scenarios, we found some MySQL bugs impact system stability. We couldn’t wait for new releases of the community version to fix these bugs. Usually we will have to fix the bugs with very limited time, and then we will report the bugs as well as the patch to community.
  3. In Alibaba, the differences between the responsibilities of an application developer and database administrator are significant. We have a very professional DBA team, and DBAs are well aware of the database system and need more features to manipulate MySQL: flow control, changing/controlling execution plan, controlling the watermark, setting blacklist without the involvement of application developer, etc. And the community version of MySQL lacks these features. The private cloud user needs these features even more than a public cloud user.

Percona: Are there differences in the online processing experience in China that are different than other countries? Especially for the Singles’ Day event?

Ying: China has huge population base and huge netizen base. With the rapid growth of China’s economy, the purchasing power of Chinese netizen is stronger and stronger. According to published data, Alibaba’s sales during Singles’ Day shopping festival 2014 was 8.99 billion USD, which was almost five times more than Cyber Monday or Black Friday’s online sales in the United States for the same year. In 2015, the amount reached 14.3 billion USD. Alibaba’s E-Commerce platform sales were one billion RMB at the first 1 minute and 12 seconds of Singles’ Day 2015. Millions of people are trying to buy the same commodity at the same time usually on that day. This is a huge challenge for the Alibaba’s online transaction processing system, and of course, for the databases sitting in the backend.

Percona: What do you see as an issue that we the database community needs to be on top of with AliSQL? What keeps you up at night with regard to the future of MySQL?

Ying: In my opinion, as an open source project MySQL has been focused on single users or the public cloud users (the “M” of “LAMP”). But with the growth of MySQL, we need to pay more attention to enterprise and big private cloud users. Some features such as administration, scalable cluster solutions, and performance optimizations for extreme scenarios are essential to enterprise users.

Percona: What are you most looking forward to at Percona Live Data Performance Conference 2016?

Ying: I am looking forward to communicating with MySQL users from all over the world, to see if we can help the community to grow even bigger with my two little cents. I am also looking forward to making more friends in the MySQL world.

You can read more about AliSQL at Ying’s website:

Want to find out more about Ying and AliSQL? Register for Percona Live Data Performance Conference 2016, and see his talk What’s new in AliSQL — Alibaba’s branch of MySQL. Use the code “FeaturedTalk” and receive $100 off the current registration price!

Percona Live Data Performance Conference 2016 is the premier open source event for the data performance ecosystem. It is the place to be for the open source community as well as businesses that thrive in the MySQL, NoSQL, cloud, big data and Internet of Things (IoT) marketplaces. Attendees include DBAs, sysadmins, developers, architects, CTOs, CEOs, and vendors from around the world.

The Percona Live Data Performance Conference will be April 18-21 at the Hyatt Regency Santa Clara & The Santa Clara Convention Center.

Is Adaptive Hash Index in InnoDB right for my workload?

April 12, 2016 - 7:26am

This blog post will discuss what the Adaptive Hash Index in InnoDB is used for, and whether it is a good fit for your workload.

Adaptive Hash Index (AHI) is one of the least understood features in InnoDB. In theory, it magically determines when it is worth supplementing InnoDB B-Tree-based indexes with fast hash lookup tables and then builds them automatically without a prompt from the user.

Since AHI is supposed to work “like magic,” it has very little configuration available. In the early versions there were no configuration options available at all. Later versions added innodb_adaptive_hash_index  to disable AHI if required (by setting it to “0” or “OFF”). MySQL 5.7 added the ability to partition AHI by enabling innodb_adaptive_hash_index_parts.  (FYI, this feature existed in Percona Server as innodb_adaptive_hash_index_partitions since version 5.5.)

To understand AHI’s impact on performance, think about it as if it were a cache. If an AHI “Hit” happens, we have much better lookup performance; if it is an AHI “Miss,” then performance gets slightly worse (as checking a hash table for matches is fast, but not free).

This is not the only part of the equation though. In addition to the cost of lookup, there is also the cost of AHI maintenance. We can compare maintenance costs – which can be seen in terms of rows added to and removed from AHI – to successful lookups. A high ratio means a lot of lookups sped up at the low cost. A low ratio means the opposite: we’re probably paying too much maintenance cost for little benefit.

Finally there is also a cost for adding an extra contention. If your workload consists of lookups to a large number of indexes or tables, you can probably reduce the impact by setting  innodb_adaptive_hash_index_parts  appropriately. If there is a hot index, however, AHI could become a bottleneck at high concurrency and might need to be disabled.

To determine if AHI is likely to help my workload, we should verify that the AHI hit and successful lookups to maintenance operations ratios are as high as possible.

Let’s investigate what really happens for some simple workloads. I will use a basic Sysbench Lookup by the primary key – the most simple workload possible. We’ll find that even in this case we’ll find a number of behaviors.

For this test, I am using MySQL 5.7.11 with a 16GB buffer pool. The base command line for sysbench is:

sysbench --test=/usr/share/doc/sysbench/tests/db/select.lua   --report-interval=1 --oltp-table-size=1 --max-time=0 --oltp-read-only=off --max-requests=0 --num-threads=1 --rand-type=uniform --db-driver=mysql --mysql-password=password --mysql-db=test_innodb  run

Looking up a single row

Notice oltp-table-size=1  from above; this is a not a mistake, but tests how AHI behaves in a very basic case:

And it works perfectly: there is a 100% hit ratio with no AHI maintenance operations to speak of.

10000 rows in the table

When we change the OLTP table setting to oltp-table-size=10000 , we get the following picture:

Again, we see almost no overhead. There is a rare incident of 16 rows or so being added to AHI (probably due to an AHI hash collision). Otherwise, it’s almost perfect.

10M rows in the table

If we change the setting to oltp-table-size=10000000, we now have more data (but still much less than buffer pool size):

In this case, there is clearly a warm-up period before we get close to the 100% hit ratio – and it never quite hits 100% (even after a longer run). In this case, maintenance operations appear to keep going without showing signs of asymptotically reaching zero. My take on this is that with 10M rows there is a higher chance of hash collisions – causing more AHI rebuilding.

500M rows in the table, uniform distribution

Let’s now set the OLTP table size as follows: oltp-table-size=500000000. This will push the data size beyond the Innodb buffer pool size.

Here we see a lot of buffer pool misses, causing the a very poor AHI hit ratio (never reaching 1%).   We can also see a large overhead of tens of thousands of rows added/removed from AHI. Obviously, AHI is not adding any value in this case

500M rows, Pareto distribution

Finally, let’s use the setting oltp-table-size=500000000, and add --rand-type=pareto. The --rand-type=pareto setting enables a skewed distribution, a more typical scenario for many real life data access patterns.

In this case we see the AHI hit ratio gradually improving, and reaching close to 50%. The  AHI maintenance overhead is going down, but never reaches anything that suggests it is worth it.

It is important to note in both this and the previous case that AHI has not reached a “steady state” yet. A steady state condition shows the number of rows added and removed becoming close to equal.

As you can see from the math in the workloads shown above, the Adaptive Hash Index in InnoDB “magic” doesn’t always happen! There are cases when AHI is indeed helpful, and then there are others when AHI adds a lot of data structure maintenance overhead and takes memory away from buffer pool – not to mention the contention overhead. In these cases, it’s better that AHI is disabled.

Unfortunately, AHI does not seem to have the logic built-in to detect if there is too much “churn” going on to make maintaining AHI worthwhile.

I suggest using these numbers as a general guide to decide whether AHI is likely to benefit your workload. Make sure to run a test/benchmark to be sure.

Interested in learning more about other InnoDB Internals? Please join me for the Innodb Architecture and Performance Optimization Tutorial at Percona Live!

Dealing with Jumbo Chunks in MongoDB

April 11, 2016 - 3:39pm

In this blog post, we will discuss how to deal with jumbo chunks in MongoDB.

You are a MongoDB DBA, and your first task of the day is to remove a shard from your cluster. It sounds scary at first, but you know it is pretty easy. You can do it with a simple command:

db.runCommand( { removeShard: "server1_set6" } )

MongoDB then does its magic. It finds the chunks and databases and balances them across all other servers. You can go to sleep without any worry.

The next morning when you wake up, you check the status of that particular shard and you find the process is stuck:

"msg" : "draining ongoing", "state" : "ongoing", "remaining" : { "chunks" : NumberLong(3), "dbs" : NumberLong(0)

There are three chunks that for some reason haven’t been migrated, so the removeShard command is stalled! Now what do you do?

Find chunks that cannot be moved

We need to connect to mongos and check the catalog:

mongos> use config switched to db config mongos> db.chunks.find({shard:"server1_set6"})

The output will show three chunks, with minimum and maximum _id keys, along with the namespace where they belong. But the last part of the output is what we really need to check:

{ [...] "min" : { "_id" : "17zx3j9i60180" }, "max" : { "_id" : "30td24p9sx9j0" }, "shard" : "server1_set6", "jumbo" : true }

So, the chunk is marked as “jumbo.” We have found the reason the balancer cannot move the chunk!

Jumbo chunks and how to deal with them

So, what is a “jumbo chunk”? It is a chunk whose size exceeds the maximum amount specified in the chunk size configuration parameter (which has a default value of 64 MB). When the value is greater than the limit, the balancer won’t move it.

The way to remove the flag from that those chunks is to manually split them. There are two ways to do it:

  1. You can specify at what point to split the chunk, specifying the corresponding _id value. To do this, you really need to understand how your data is distributed and what the settings are for min and max in order to select a good splitting point.
  2. You can just tell MongoDB to split it by half, letting it decide which is the best possible _id. This is easier and less error prone.

To do it manually, you need to use sh.splitAt(). For example:

sh.splitAt("dbname", { _id: "19fr21z5sfg2j0" })

In this command, you are telling MongoDB to split the chunk in two using that _id as the cut point.

If you want MongoDB to find the best split point for you, use the sh.splitFind() command. In this particular case, you only need to specify a key (any key) that is part of the chunk you want to split. MongoDB will use that key to find that particular chunk, and then divide it into two parts using the _id that sits in the middle of the list.

sh.splitFind("dbname", { _id : "30td24p9sx9j0" })

Once the three chunks have been split, the jumbo flag is removed and the balancer can move them to a different server. removeShard will complete the process and you can drink a well-deserved coffee.

Downloading MariaDB MaxScale binaries

April 11, 2016 - 12:13pm

In this blog post we’ll discuss a caveat when downloading MariaDB MaxScale binaries.

Following the previous performance results in my last two posts on sysbench and primary keys ( and, I wanted to measure overhead from proxies servers like ProxySQL and MaxScale.

Unfortunately, I found that MaxScale binaries are not available without registering on the portal. That in itself isn’t a bad thing, but to complete the registration you need to agree to an Evaluation Agreement. The agreement requests you comply with MariaDB Enterprise Terms and Conditions (you can find the text of the agreement here: MariaDB_Enterprise_Subscription_Agreement_US_v14_0).

Personally, I don’t agree with MariaDB’s “Evaluation Agreement” or the “MariaDB Enterprise Terms and Conditions,” so it left me without binaries!

In general, I strongly advise you to carefully read both documents – or, even better, ask your legal team if you can accept MariaDB’s “Evaluation Agreement.”

Fortunately, MaxScale’s source code is available from I had to build binaries myself, which I will share with you in this post! You can get MaxScale 1.4.1 binaries here No “Evaluation Agreement” needed!

I will follow up in a future post with my proxies testing results.

MySQL Data at Rest Encryption

April 8, 2016 - 12:29pm

This blog post will discuss the issues and solutions for MySQL Data at Rest encryption.

Data at Rest Encryption is not only a good-to-have feature, but it is also a requirement for HIPAA, PCI and other regulations.

There are three major ways to solve data encryption at rest:

  1. Full-disk encryption
  2. Database-level (table) encryption
  3. Application-level encryption, where data is encrypted before being inserted into the database

I consider full disk encryption to be the weakest method, as it only protects from someone physically removing the disks from the server. Application-level encryption, on the other hand, is the best: it is the most flexible method with almost no overhead, and it also solves data in-flight encryption. Unfortunately, it is not always possible to change the application code to support application-level encryption, so database-level encryption can be a valuable alternative. Sergei Golubchik, Chief Architect at MariaDB, outlined the pluses and minuses of database level encryption during his

Sergei Golubchik, Chief Architect at MariaDB, outlined the pluses and minuses of database level encryption during his session at Percona Live Amsterdam:


  • Full power of DBMS is available
  • Full power of DBMS is availableEasy to implement
  • Easy to implementOnly database can see the data
  • Only databases can see the dataPer-table encryption, per-table keys, performance
  • Per-table encryption, per-table keys, performanceCannot be done per-user


  • Cannot be done per-user
  • Does not protect against malicious root user

Data at Rest Encryption: Database-Level Options

Currently, there are two options for data at rest encryption at the database level:

MariaDB’s implementation is different from MySQL 5.7.11. MySQL 5.7.11 only encrypts InnoDB tablespace(s), while MariaDB has an option to encrypt undo/redo logs, binary logs/relay logs, etc. However, there are some limitations (especially together with Galera Cluster):

  • No key rotation in the open source plugin version (MySQL 5.7.11 has a key rotation)
  • mysqlbinlog does not work with encrypted binlogs (bug reported)
  • Percona XtraBackup does not work, so we are limited to RSYNC as SST method for Galera Cluster, which is a blocking method (one node will not be available for writes during the SST). The latest Percona XtraBackup works with MySQL 5.7.11 tablespace encryption
  • The following data is not encrypted (bug reported)
    • Galera gcache + Galera replication data
    • General log / slow query log

Database level encryption also has its weakness:

  1. Root and MySQL users can read the keyring file, which defeats the purpose. However, it is possible to place a key on the mounted drive and unmount it when MySQL starts (that can be scripted). The downside of this is that if MySQL crashes, it will not be restarted automatically without human intervention.
  2. Both MariaDB version and MySQL version only encrypt data when writing to disk – data is not encrypted in RAM, so a root user can potentially attach to MySQL with gdb/strace or other tools and read the server memory. In addition, with gdb it is possible to change the root user password structure and then use mysqldump to copy data. Another potential method is to kill MySQL and start it with skip-grant-tables. However, if the key is unmounted (i.e., on USB drive), MySQL will either not start or will not be able to read the encrypted tablespace.

MariaDB Encryption Example

To enable the full level encryption we can add the following options to my.cnf:

[mysqld] file_key_management_filekey = FILE:/mount/keys/mysql.key file-key-management-filename = /mount/keys/mysql.enc innodb-encrypt-tables = ON innodb-encrypt-log = 1 innodb-encryption-threads=1 encrypt-tmp-disk-tables=1 encrypt-tmp-files=1 encrypt-binlog=1 file_key_management_encryption_algorithm = AES_CTR

After starting MariaDB with those settings, it will start encrypting the database in the background. The file_key_management plugin is used; unfortunately, it does not support key rotation. The actual keys are encrypted with:

# openssl enc -aes-256-cbc -md sha1 -k <key> -in keys.txt -out mysql.enc

The encryption <key> is placed in /mount/keys/mysql.key.

After starting MySQL, we can unmount the “/mount/key” partition. In this case, the key will not be available and a potential hacker will not be able to restart MySQL with “–skip-grant-tables” option (without passwords). However, it also prevents normal restarts, especially SSTs (cluster full sync).

Additional notes:

  1. Encryption will affect the compression ratio, especially for the physical backups (logical backups, i.e. mysqldump does not matter as the data retrieved is not encrypted). If your original compressed backup size was only 10% of the database size, it will not be the case for the encrypted tables.
  2. Data is not encrypted in flight and will not be encrypted on the replication slaves unless you enable the same options on the slaves. The encryption is also local to the server, so when encryption was just enabled on a server some tables may not be encrypted yet (but will be eventually)
  3. To check which tables are encrypted, use the Information Schema INNODB_TABLESPACES_ENCRYPTION table, which contains encryption information. To find all tables that are encrypted, use this query:
    select * from information_schema.INNODB_TABLESPACES_ENCRYPTION where ENCRYPTION_SCHEME=1

MySQL 5.7 Encryption Example

To enable encryption, add the following option to my.cnf:

[mysqld] keyring_file_data=/mount/mysql-keyring/keyring

Again, after starting MySQL we can unmount the “/mount/mysql-keyring/” partition.

To start encrypting the tables, we will need to run alter table table_name encryption='Y' , as MySQL will not encrypt tables by default.

The latest Percona Xtrabackup also supports encryption, and can backup encrypted tables.

To find all encrypted tablespaces in MySQL/Percona Server 5.7.11, we can use information_schema.INNODB_SYS_TABLESPACES and the flag field. For example, to find normally encrypted tables, use the following query:

mysql> select * from information_schema.INNODB_SYS_TABLESPACES where flag = 8225G *************************** 1. row *************************** SPACE: 4688 NAME: test/t1 FLAG: 8225 FILE_FORMAT: Barracuda ROW_FORMAT: Dynamic PAGE_SIZE: 16384 ZIP_PAGE_SIZE: 0 SPACE_TYPE: Single FS_BLOCK_SIZE: 4096 FILE_SIZE: 98304 ALLOCATED_SIZE: 98304 *************************** 2. row *************************** SPACE: 4697 NAME: sbtest/sbtest1_enc FLAG: 8225 FILE_FORMAT: Barracuda ROW_FORMAT: Dynamic PAGE_SIZE: 16384 ZIP_PAGE_SIZE: 0 SPACE_TYPE: Single FS_BLOCK_SIZE: 4096 FILE_SIZE: 255852544 ALLOCATED_SIZE: 255856640 2 rows in set (0.00 sec)

You can also use this query instead: select * from information_schema.tables where CREATE_OPTIONS like '%ENCRYPTION="Y"%';.

Performance overhead

This is a debatable topic, especially for the MariaDB implementation when everything is configured to be encrypted. During my tests I’ve seen ~10% of overhead for the standalone MySQL instance, and ~20% with Galera Cluster.

The MySQL 5.7/Percona Server 5.7 tablespace-level encryption shows an extremely low overhead, however, that needs to be tested in different conditions.


Even with all the above limitations, database-level encryption can be a better option than the filesystem-level encryption if the application can not be changed. However, it is a new feature (especially MySQL 5.7.11 version) and I expect a number of bugs here.

General Inquiries

For general inquiries, please send us your question and someone will contact you.