EmergencyEMERGENCY? Get 24/7 Help Now!
Subscribe to Latest MySQL Performance Blog posts feed
Percona's Data performance and scalability blog
Updated: 1 hour 34 min ago

OpenSSH CVE-2016-0777: Details and Mitigation

January 14, 2016 - 12:04pm

Earlier today advisories were sent out regarding OpenSSH versions 5.4 through 7.1., informing users about a security bug in the software. In essence, the advisory instructed people to add the  UseRoaming no option to their ssh_config file, with a promise for further information to be made available shortly.

 

Add undocumented "UseRoaming no" to ssh_config or use "-oUseRoaming=no" to prevent upcoming #openssh client bug CVE-2016-0777. More later.

— markus (@msfriedl) January 14, 2016

The post on the security issue at OpenBSD Journal can be seen here: http://undeadly.org/cgi?action=article&sid=20160114142733

This information was then later released detailing the issue and the implications here: http://www.openssh.com/txt/release-7.1p2

The statement below summarized the main issue:

“The matching server code has never been shipped, but the client code was enabled by default and could be tricked by a malicious server into leaking client memory to the server, including private client user keys.”

So what does this all mean? Simply speaking, this means a malicious or compromised server could potentially retrieve the users private SSH keys from memory. The stolen keys could then be used to authenticate against servers.

(2FA helps to protect servers from the use of stolen keys, however this is not in as widespread use as it should be.)

The short summary is in lieu of an update to the software. In the meantime, you can use the following mitigation options to protect yourself:

  1. In your ~/.ssh/config:
    Host * UseRoaming no
  2. In your ssh_config:
    Linux: /etc/ssh/ssh_config OSX: /private/etc/ssh/ssh_config
  3. On each CLI execution:
    ssh -oUseRoaming=no <hostname>    

Personally, I’ve used a combination of 1 and 2, as often a ~/.ssh/config cleanup is required. Make sure and check that your OpenSSH is correctly configured, and keep watching for updates.

Prometheus as an Engine for MySQL Monitoring

January 14, 2016 - 7:23am

When I first discovered Graphite years ago, I was very impressed with its monitoring capabilities.  Compared to many RRD-based tools that were popular at the time (like Cacti), Graphite separated the captured data and graphs, allowing you to do all kinds of math and transformations while visualizing data. For example, I could plot the relationship between system queries and disk IO, and capture how the system was getting more IO bound over time. It also had reasonably high performance, allowing me to capture high-resolution data for medium-sized systems.

Just last year I discovered Prometheus, and it also impressed me. I think it has the potential to take Graphite’s flexibility to the next level. Though I am in no way a Prometheus expert, I  want to share my understanding and thoughts on it so far.

Data Model

The data model is perhaps what attracted me to Prometheus the most. While it’s not obvious at first, when you do figure it out it has fantastic flexibility.

In the data model used by Whisper and Carbon in Graphite, you will use something like this to store MySQL data:

myapp.store.mysql.db01.status.questions = 5000

You can set up any hierarchy you like, but it has to have a hierarchy.

What Prometheus does instead is allow you to use a set of key-value pairs. The same data shown above could be presented like this:

questions_total{app=”myapp”,subsystem=”store”,engine=”mysql”,host=”db01”, source=”status”} = 5000

(You most likely wouldn’t use this exact structure, but it’s good for illustration.)

The difference between these approaches it that Prometheus provides you multiple dimensions on which you can filter and aggregate, plus you can add those dimensions later as you need them (without needing to redesign your tree hierarchy).

These labels are very dynamic, and I can change them in a second. For example, a MySQL server reporting as a “Master” might start reporting as a “Slave” in the very next second, and its data will be aggregated differently.

This is especially important in the modern, often cloud-based and virtualized world. For example, using Prometheus it is very easy to tag servers by their region or availability zones. I can also do things like compute MySQL space usage by both the database and storage engine. The possibilities are endless.

Data Capture

Unlike Graphite – where the main model is push and the hosts themselves choose what kind of information they want to push to monitoring system and at which intervals – with Prometheus you set up “Exporters” that have the ability to export the data. It is up to the Prometheus server configuration to choose what data to sample and how frequently.

The clear benefit of Prometheus’ approach is that you can have as many servers as you like pulling the data, so it is very easy to create a development version of your system and play around with it – without affecting production. It also provides a simple pathway to high availability.

(Both the push and pull approaches have their benefits and drawbacks. Brian Brazil wrote an excellent article advertising the pull model of monitoring.)

Prometheus does create a few challenges for me. Unless I want to set up Service Discovery, it is a hassle to monitor any development/test VMs I might spin up (that would otherwise not be open to external access at all). While this isn’t the main use case for Prometheus, it is helpful for me to test the dashboard’s behavior with different operating systems, workloads, etc.

A more significant issue I discovered is dealing with some data that can’t be captured to multiple locations, because the data capture causes the data to change.

Here is specific example: if I look at the events_statements_summary_by_digest table in PERFORMANCE_SCHEMA, there is a MAX_TIMER_WAIT field that shows me what the maximum query execution time is for the query pattern. If I want to get the maximum query execution time for every minute, for example, I would need to “truncate” the table to reset the statistics and let the maximum value be computed again. If I don’t perform that operation, the data becomes meaningless. If I make the exporter to reset the statistics during the poll, however, I can’t pull it from two Prometheus servers.

This is one instance where Prometheus’ performance schema design could be better. I could set up a Cron job or Event to clear out the statistics regularly and get a  proper maximum value for every five minutes, but that isn’t an overly convenient solution.

Another issue I discovered is that Prometheus doesn’t have any protection from bad (long) samples, or a very good method of detecting of them. Let’s imagine that I have a MySQL server and I’m sampling status data every second. For some reason the call to SHOW GLOBAL STATUS took five seconds to execute. The truth is we don’t really know where in those five seconds the SHOW GLOBAL STATUS output corresponds – it might be at very start, it might be at the very end. As such, you don’t really know how to process the counters. Whatever you do, you’re likely to be wrong. My preference in this case it to simply discard such samples, because even missing one percent of the samples is unlikely to change the whole picture. Constantly questioning whether you really had a couple of seconds where the QPS spiked to ten times the normal rate, or that it’s an invalid sample, is not something I on which I want to waste a lot of time!

My preferred approach is to configure the SHOW GLOBAL STATUS capture so that if it takes more than ten percent of the capture interval, it will be discarded. For example, with a one second capture I would allow 100ms for the capture. If the system is not keeping up with this scale, I would be better to not fool myself and reduce the capture resolution to around five seconds.

The only protection Prometheus allows is to configure the scrape_timeout, but unfortunately it is only limited to one second resolution at this point. This is too coarse for any high-resolution capture.

Finally, it is also inconvenient to specify different resolutions for different data. In MySQL there is a often a lot of data that I want to capture, but the resolution needed for each capture is different. For example, SHOW GLOBAL STATUS with one second resolution is must. At the same time, capturing the table size information from INFORMATION_SCHEMA with a one second resolution would put too much load on MySQL, especially if there are a lot of tables. That level of resolution in this case isn’t really needed.

An attractive thing about Prometheus is that the Prometheus development team uses it a lot for MySQL monitoring, so the MySQL Exporter is really good. Most MySQL monitoring plugins I find resort to reporting just a few basics statistics, which is not nearly enough for advanced diagnostics. The Prometheus MySQL exporter gets tons of stuff and has been adding more in every version.

I also very much like that the Prometheus Exporters are designed using HTTP protocol. This means it is very easy to debug or see what kind of data they capture. They present it simply using a web-browser:

Computational Model

I think the basic operations in Prometheus are pretty intuitive, but if you look at some of the advanced behaviors you’re going to find some inconveniences and some things that are likely to surprise you.

One inconvenience is that Prometheus is mainly designed for working with high resolution data. If there are more than five minute holes (by default) in the time series, they could disappear from the graphs. As I mentioned, for MySQL there is quite a lot of information that it makes sense to capture at a resolution lower than five minutes.

Prometheus functions are looking in the “past,” and designed in a way that the value of the function at any time (T) when it could be computed is not going to change. It all looks clean and logical, but it causes issues with holes in the data capture.  

As an example, let’s imagine following five seconds where the total number of questions from the start successfully scrapped some seconds but not others (due to a network issue, overload, etc.):

1 - 10 2 - 20 3 - X 4 - X 5 - 200

When we capture data of “counter” type the most important value it has is not the actual counter value at the given time but the rate of change of the counter at different time intervals. If in this case, for example, the query rate was ten QPS for intervals one through two seconds, this can be clearly computed. But what was the query rate in the three through four second interval? We don’t really have exact data, but that is fine: we know there have been 180 queries during the two through five second interval, giving us 60 QPS (which we can use for the three through four seconds interval).

This is NOT, however, how Prometheus will compute it if you use a high irate() function (which is suppose to give you highest resolution possible). When you evaluate irate() at T=4, it doesn’t have access to the T=5 value, even if it is in the database. Instead, it will look back and find the matching previous interval (one through two) and use the corresponding value of ten QPS.

I find this pretty puzzling and inconvenient.

There is also the rate() function, which can be used to get the average rate for the period.  Unfortunately it can’t estimate the rate for a smaller period based on the available data for a longer period. So for example if I ask rate() function to compute a query rate at T=4, looking one second back, it will return no data. This isn’t a big deal when you’re working with data manually, but if you’re building zoomable dashboards it means you can zoom in to the point where the data will disappear (rather than stopping at the best possible value available).

Storage

Prometheus has its own high performance storage system which is based in part on LevelDB. It is highly optimized for time series and can achieve a very high level of compression. Be ready, though: all your label combinations will create a different time series on the low level, and will require a lot of files. This isn’t really a problem with SSD drives and modern file systems, but it something to look out for.

The capture engine and storage systems are rather efficient. Even though Prometheus does not have built in clustering for “scaling out,” you can reportedly get more than 300K metrics per second captured on a single node. You can also use multiple Prometheus servers as needed.

The problem I found with Prometheus’ storage is that it is very self contained: you can only use it from Prometheus or access it from the HTTP API. There are no tools at this point to export it for advanced analysis with R, or to dump the whole database into something like JSON format so it can be loaded into a different database engine. Some of these features might already be on roadmap.

Purging and Aggregation

Retention configuration in Prometheus is pretty spartan. You can set storage.local.retention to the length you want to store the data, but that’s it. You can’t configure it to purge different data at different times. You can run multiple Prometheus instances to achieve this, but it’s quite a hassle.  It’s also not possible to instruct Prometheus to automatically build summaries in order to execute low resolution queries faster.

For example if I have MySQL’s query rate captured every second, but I want to view the data over a long time period (e.g., how it changed over last three months to estimate growth trends), data aggregated at hour intervals would be enough for that purpose.

There is support for recording rules to help achieve some of this, but it is not explicit or convenient in my opinion.

Looking at the Prometheus roadmap, some of these issues might not be fixed in Prometheus but achieved through integrating other systems such as InfluxDB (where experimental support already exists).

Purpose

A lot of these limitations make sense if you look at the purpose for which Prometheus was created: getting high-resolution data and being able to provide as much troubleshooting information as possible to its Alerting engine. It is not really designed for storing extensive history. Too bad! I would very much like to get both of those properties in the single system!

Visualization

As you install Prometheus, it has a built-in Expression Browser, which is great for debugging and interactive analyses. It also allows you to see what data you actually have in the database. It will disappoint you, however, if you’re looking for beautiful graphs!

This shows I have the information about MySQL query rate from two servers, as well as the available and configured labels.

If I want to pick one server and look at the average rate of queries per five minutes, I can do this:

There are some tools available in the graph to chose the time range and resolution.

You should aware that visualizing data with the rate() function often shows you things that do not exist. In this case, it looks like the number of queries was gradually creeping up. In reality, I just started the benchmark so the number of queries jumped almost immediately. This is what the real situation looks like (using irate()):

As I explained before, irate() does not handle missing data points very well, plus it behaves somewhat bizarrely when you “zoom out” – providing instant rate information at sparse intervals (e.g., the instant rate computed every one second over 60 seconds) rather than smoothing things to averages.

There is also the PromDash tool available for Prometheus, which gives you nicer looking dashboards and supports a lot of Prometheus’ features. Now that Grafana has official support for Prometheus, it is my preferred tool to build dashboards – especially since it supports multiple data sources besides Prometheus.

Summary

I’m very excited about Prometheus. It allows me to get a lot of information easily and use it for Performance analyses in benchmarking or troubleshooting. It would be great if it also had a simple integrated solution for long term storage and trending. I am also looking forward to better integration with Grafana and better documentation on how to create Prometheus-based dashboards – especially with some Prometheus-based examples!  

Note: All above was written about Prometheus 0.16.1. Prometheus is rapidly evolving and may  change with newer versions.

Play the Percona Powerball Pool!!

January 12, 2016 - 10:25am
The Only Sure Thing is Percona Powerball Pool

Everyone is talking about the upcoming Powerball lottery draw. 1.4 BILLION dollars!! And growing! Millions of people are imagining what they would do IF they win. It’s the stuff of dreams.

That is literally true. The chances of winning the Powerball Lottery are 1 in 292.2 million. Or roughly speaking, the chances of picking the right combination of numbers is like flipping a coin and getting heads 28 times in a row. You’re more likely to get struck by lightning (twice) or bitten by a shark.

Sorry.

You know what is a sure thing? Percona’s ability to optimize your database performance and increase application performance. Our Support and Percona Care consultants will give you a 1 in 1 chance of making your database run better, solving your data performance issues, and improving the performance of your applications.

However, in the spirit of moment, Percona has bought 10 sets of Powerball numbers and have posted them on Facebook, Twitter and LinkedIn. It’s the Percona Powerball Pool! Like either post and share it, and you are qualified for one (1) equal share of the winnings! Use #perconapowerball when you share.

Here are the numbers:

We at Percona can’t promise a huge Powerball windfall (in fact, as data experts we’re pretty sure you won’t win!), but we can promise that our consultants are experts at helping you with your full LAMP stack environments. Anything affecting your data performance – on that we can guarantee you a win!

Full rules are here.

Percona Server 5.6.28-76.1 is now available

January 12, 2016 - 6:30am

Percona is glad to announce the release of Percona Server 5.6.28-76.1 on January 12, 2016. Download the latest version from the Percona web site or from the Percona Software Repositories.

Based on MySQL 5.6.28, including all the bug fixes in it, Percona Server 5.6.28-76.1 is the current GA release in the Percona Server 5.6 series. Percona Server is open-source and free – and this is the latest release of our enhanced, drop-in replacement for MySQL. Complete details of this release can be found in the 5.6.28-76.1 milestone on Launchpad.

Bugs Fixed:

  • Clustering secondary index could not be created on a partitioned TokuDB table. Bug fixed #1527730 (DB-720).
  • When enabled, super-read-only option could break statement-based replication while executing a multi-table update statement on a slave. Bug fixed #1441259.
  • Running OPTIMIZE TABLE or ALTER TABLE without the ENGINE clause would silently change table engine if enforce_storage_engine variable was active. This could also result in system tables being changed to incompatible storage engines, breaking server operation. Bug fixed #1488055.
  • Setting the innodb_sched_priority_purge variable (available only in debug builds) while purge threads were stopped would cause a server crash. Bug fixed #1368552.
  • Small buffer pool size could cause XtraDB buffer flush thread to spin at 100% CPU. Bug fixed #1433432.
  • Enabling TokuDB with ps_tokudb_admin script inside the Docker container would cause an error due to insufficient privileges even when running as root. In order for this script to be used inside Docker containers this error has be been changed to a warning that a check is impossible. Bug fixed #1520890.
  • InnoDB status will start printing negative values for spin rounds per wait, if the wait number, even though being accounted as a signed 64-bit integer, will not fit into a signed 32-bit integer. Bug fixed #1527160 (upstream #79703).

Other bugs fixed: #1384595 (upstream #74579), #1384658 (upstream #74619), #1471141 (upstream #77705), #1179451, #1524763 and #1530102.

Release notes for Percona Server 5.6.28-76.1 are available in the online documentation. Please report any bugs on the launchpad bug tracker .

Percona Server 5.5.47-37.7 is now available

January 12, 2016 - 6:10am


Percona is glad to announce the release of Percona Server 5.5.47-37.7 on January 12, 2016. Based on MySQL 5.5.47, including all the bug fixes in it, Percona Server 5.5.47-37.7 is now the current stable release in the 5.5 series.

Percona Server is open-source and free. Details of the release can be found in the 5.5.47-37.7 milestone on Launchpad. Downloads are available here and from the Percona Software Repositories.

Bugs Fixed:

  • Running OPTIMIZE TABLE or ALTER TABLE without the ENGINE clause would silently change table engine if enforce_storage_engine variable was active. This could also result in system tables being changed to incompatible storage engines, breaking server operation. Bug fixed #1488055.

Other bugs fixed: #1179451, #1524763, and #1530102.

Release notes for Percona Server 5.5.47-37.7 are available in our online documentation. Bugs can be reported on the launchpad bug tracker.

Percona Monitoring Plugins 1.1.6 release

January 11, 2016 - 10:01pm

Percona is glad to announce the release of Percona Monitoring Plugins 1.1.6.

Changelog:

  • Added new RDS instance classes to RDS scripts.
  • Added boto profile support to RDS scripts.
  • Added AWS region support and ability to specify all regions to RDS scripts.
  • Added ability to set AWS region and boto profile on data source level in Cacti.
  • Added period, average time and debug options to pmp-check-aws-rds.py.
  • Added ability to override Nginx server status URL path on data source level in Cacti.
  • Made Memcached and Redis host configurable for Cacti script.
  • Added the ability to lookup the master’s server_id when using pt-heartbeat with pmp-check-mysql-replication-delay.
  • Changed how memory stats are collected by Cacti script and pmp-check-unix-memory.
    Now /proc/meminfo is parsed instead of running free command. This also fixes pmp-check-unix-memory for EL7.

  • Set default MySQL connect timeout to 5s for Cacti script. Can be overridden in the config.
  • Fixed innodb transactions count on the Cacti graph for MySQL 5.6 and higher.
  • Fixed –login-path option in Nagios scripts when using it along with other credential options.

Thanks to contributors: David Andruczyk, Denis Baklikov, Mischa ter Smitten, Mitch Hagstrand.

The project is fully hosted on Github now including issues and Launchpad project is discontinued.

A new tarball is available from downloads area or in packages from our software repositories. The plugins are fully supported for customers with a Percona Support contract and free installation services are provided as part of some contracts. You can find links to the documentation, forums and more at the project homepage.

About Percona Monitoring Plugins
Percona Monitoring Plugins are monitoring and graphing components designed to integrate seamlessly with widely deployed solutions such as Nagios, Cacti and Zabbix.

Bare-metal servers for button-push database-as-a-service

January 11, 2016 - 9:50am

Enterprises’ demand flexibility, scalability and efficiency to keep up with the demands of their customers — while maintaining the bottom line. To solve this, they’re running to cloud infrastructure services to both cut costs and take advantage of cutting-edge technology innovations. Clouds have brought simplicity and ease of use to infrastructure management. However, with this ease of use often comes some sacrifice: namely, performance.

Performance degradation often stems from the introduction of virtualization and a hypervisor layer. While the hypervisor enables the flexibility and management capabilities needed to orchestrate multiple virtual machines on a single box, it also creates additional processing overhead.

Regardless, cloud servers also have huge advantages: they deploy at lightning speed and enable hassle-free private networking without the need for a private VLAN from the datacenter. They also allow the customer near instantaneous scalability without the burden of risky capital expenditures.

Bare-metal servers are one solution to this trade-off. A bare metal server is all about plain hardware. It is a single-tenant physical server that is completely dedicated to a single data-intensive workload. It prioritizes performance and reliability. A bare-metal server provides a way to enable cloud services that eliminates the overhead of virtualization, but retains the flexibility, scalability and efficiency.

On certain CPU-bound workloads, bare metal servers can outperform a cloud server of the same configuration by four times. Database management systems, being very sensitive to both CPU performance and IO speed, can obviously benefit from access to a bare metal environment.

Combine a bare metal server accessible via a cloud service with a high performance MySQL solution and you get all benefits of the cloud without sacrificing performance. This is an ideal solution for startups, side projects or even production applications.

In fact this is just what we’ve done with a partnership between Percona and Servers.com, where you can automatically provision Percona Server for MySQL on one of their bare metal servers. You can learn more about this service here.

MongoDB revs you up: What storage engine is right for you? (Part 2)

January 11, 2016 - 9:48am
Differentiating Between MongoDB Storage Engines: WiredTiger

In our last post, we discussed what a storage engine is, and how you can determine the characteristics of one versus the other. From that post:

“A database storage engine is the underlying software that a DBMS uses to create, read, update and delete data from a database. The storage engine should be thought of as a “bolt on” to the database (server daemon), which controls the database’s interaction with memory and storage subsystems.”

Check out the full post here.

Generally speaking, it’s important to understand what type of work environment the database is going to interact with, and to select a storage engine that is tailored to that environment.

The last post looked at MMAPv1, the original default engine for MongoDB (through release 3.0). This post will examine the new default MongoDB engine: WiredTiger.

WiredTiger

Find it in: MongoDB or Percona builds

MongoDB, Inc. introduced WiredTiger for document-level concurrency control for write operations in MongoDB v3.0. As a result of this introduction, multiple clients can now modify different documents of a collection at the same time. WiredTiger in MongoDB currently only supports B-trees for the data structure. However, it also has the ability to use LSM-trees, but it is not currently implemented in the MongoDB version of the engine.

WiredTiger has a few interesting features, most notably compression, document-level locking, and index prefix compression. B-trees, due to their rigidity in disk interaction and chattiness with storage, are not typically known for their performance when used with compression. However, WiredTiger has done an excellent job of maintaining good performance with compression and gives a decent performance/compression ratio with the “snappy” compression algorithm. Be that as it may, if deeper compression is necessary, you may want to evaluate another storage engine. Index prefix compression is a unique feature that should improve the usefulness of the cache by decreasing the size of indexes in memory (especially very repetitive indexes).

WiredTiger’s ideal use cases include data that are likely to stay within a few multiples of cache size. One can also expect good performance from TTL-like workloads, especially when data is within the limit previously mentioned.

Conclusion

Most people don’t know that they have a choice when it comes to storage engines, and that the choice should be based on what the database workload will look like. Percona’s Vadim Tkachenko performed an excellent benchmark test comparing the performances of RocksDB, PerconaFT and WiredTiger to help specifically differentiate between these engines.

In the next post, we’ll take a closer look at Percona’s MongoDB storage engine: PerconaFT.

 

Percona XtraDB Cluster 5.6.27-25.13 is now available

January 11, 2016 - 4:16am

Percona is glad to announce the new release of Percona XtraDB Cluster 5.6 on January 11, 2016. Binaries are available from the downloads area or from our software repositories.

Percona XtraDB Cluster 5.6.27-25.13 is now the current release, based on the following:

All of Percona software is open-source and free, and all the details of the release can be found in the 5.6.26-25.12 milestone at Launchpad.

For more information about relevant Codership releases, see this announcement.

NOTE: Due to new dependency on libnuma1 package in Debian/Ubuntu, please run one of the following commands to upgrade the percona-xtradb-cluster-server-56 package:

  • aptitude safe-upgrade
  • apt-get dist-upgrade
  • apt-get install percona-xtradb-cluster-server-5.6

New Features:

  • There is a new script for building Percona XtraDB Cluster from source. For more information, see Compiling and Installing from Source Code.
  • wsrep_on is now a session only variable. That means toggling it will not affect other clients connected to said node. Only the session/client modifying it will be affected. Trying to toggle wsrep_on in the middle of a transaction will now result in an error. Trx will capture the state of wsrep_on during start and will continue to use it. Start here means when the first logical changing statement is executed within transaction context.

Bugs Fixed:

  • #1261688 and #1292842: Fixed race condition when two skipped replication transactions were rolled back, which caused [ERROR] WSREP: FSM: no such a transition ROLLED_BACK ->ROLLED_BACK with LOAD DATA INFILE
  • #1362830: Corrected xtrabackup-v2 script to consider only the last specified log_bin directive in my.cnf. Multiple log_bin directives caused SST to fail.
  • #1370532: Toggling wsrep_desync while node is paused is now blocked.
  • #1404168: Removed support for innodb_fake_changes variable.
  • #1455098: Fixed failure of LDI on partitioned table. This was caused by partitioned table handler disabling bin-logging and Native Handler (InnoDB) failing to generate needed bin-logs eventually causing skipping of statement replication.
  • #1503349: garbd now uses default port number if it is not specified in sysconfig.
  • #1505184: Corrected wsrep_sst_auth script to ensure that user name and password for SST is passed to XtraBackup through internal command-line invocation. ps -ef doesn’t list these credentials so passing it internally is fine, too.
  • #1520491: FLUSH TABLE statements are not replicated any more, because it lead to an existing upstream fix pending deadlock error. This fix also takes care of original fix to avoid increment of local GTID.
  • #1528020: Fixed async slave thread failure caused by redundant updates of mysql.event table with the same value. Redundant updates are now avoided and will not be bin-logged.
  • Fixed garb init script causing new UUIDs to be generated every time it runs. This error was due to missing base_dir configuration when gardb didn’t have write-access to current working directory. garbd will now try to use cwd. Then it will try to use /var/lib/galera (like most Linux daemons). If it fails to use or create /var/lib/galera, it will throw a fatal error.
  • Fixed replication of DROP TABLE statement with a mix of temporary and non-temporary tables (for example, DROP TABLE temp_t1, non_temp_t2), which caused errorneous DROP TEMPORARYTABLE stmt on replicated node. Corrected it by detecting such scenarios and creating temporary table on the replicated node, which is then dropped by follow-up DROP statement. All this workload should be part of same unit as temporary tables are session-specific.
  • Fixed error when wsrep_cluster_name value over 32 characters long caused gmcast message to exceed maximum length. Imposed a limit of 32 character on wsrep_cluster_name.
  • Added code to properly handle default values for wsrep_* variables, which caused an error/crash.
  • Fixed error when a CREATE TABLE AS SELECT (CTAS) statement still tried to certify a transaction on a table without primary key even if certification of tables without primary key was disabled. This error was caused by CTAS setting trx_id (fake_trx_id) to execute SELECT and failing to reset it back to -1 during INSERT as certification is disabled.
  • Fixed crashing of INSERT .... SELECT for MyISAM table with wsrep_replicate_myisam set to ON. This was caused by TOI being invoked twice when source and destination tables were MyISAM.
  • Fixed crash when caching write-set data beyond configured limit. This was caused by TOI flow failing to consider/check error resulting from limit enforcement.
  • Fixed error when loading MyISAM table from schema temporary table (with wsrep_replicate_myisam set to ON). This was caused by temporary table lookup being done usingget_table_name(), which could be misleading as table_name for temporary tables is set to temporary generated name. Original name of the table is part of table_alias. The fix corrected condition to consider both table_name and alias_name.
  • Fixed error when changing wsrep_provider in the middle of a transaction or as part of a procedure/trigger. This is now blocked to avoid inconsistency.
  • Fixed TOI state inconsistency caused by DELAYED_INSERT on MyISAM table (TOI_END was not called). Now the DELAYED_ qualifier will be ignored and statement will be interpreted as normal INSERT.
  • Corrected locking semantics for FLUSH TABLES WITH READ LOCK (FTWRL). It now avoids freeing inheritted lock if follow-up FLUSH TABLE statement fails. Only frees self-acquired lock.
  • Fixed crash caused by GET_LOCK + wsrep_drupal_282555_workaround. GET_LOCK path failed to free all instances of user-level locks after it inherited multiple-user-locks from Percona Server. The cleanup code now removes all possible references of locks.
  • Fixed cluster node getting stuck in Donor/Desync state after a hard recovery, because of an erroneous type cast in source code.
  • Corrected DDL and DML semantics for MyISAM:
    • DDL (CREATE/DROP/TRUNCATE) on MyISAM will be replicated irrespective of wsrep_replicate_miysam value
    • DML (INSERT/UPDATE/DELETE) on MyISAM will be replicated only if wsrep_replicate_myisam is enabled
    • SST will get full transfer irrespective of wsrep_replicate_myisam value (it will get MyISAM tables from donor if any)
    • Difference in configuration of pxc-cluster node on enforce_storage_engine front may result in picking up different engine for same table on different nodes
    • CREATE TABLE AS SELECT (CTAS) statements use non-TOI replication and are replicated only if there is involvement of InnoDB table that needs trx (involvement of MyISAM table will cause CTAS statement to skip replication)

Known Issues:

  • 1330941: Conflict between wsrep_OSU_method set to RSU and wsrep_desync set to ON was not considered a bug.
  • 1443755: Causal reads introduces surprising latency in single node clusters.
  • 1522385: Holes are introduced in Master-Slave GTID eco-system on replicated nodes if any of the cluster nodes are acting as asynchronous slaves to an independent master.
  • SST fails with innodb_data_home_dir/innodb_log_home_dir. This is a bug in Percona XtraBackup. It should be fixed in the next 2.3.2 release. Until then, please use 2.2.12 that doesn’t have this issue.
  • Enabling wsrep_desync (from previous OFF state) will wait until previous wsrep_desync=OFF operation is completed.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

ordering_operation: EXPLAIN FORMAT=JSON knows everything about ORDER BY processing

January 8, 2016 - 10:25pm

We’ve already discussed using the ORDER BY clause with subqueries. You can also, however, use the ORDER BY clause with sorting results of one of the columns. Actually, this is most common way to use this clause.

Sometimes such queries require using temporary tables or filesort, and a regular EXPLAIN  clause provides this information. But it doesn’t show if this job is needed for ORDER BY or for optimizing another part of the query.

For example, if we take a pretty simple query ( select distinct last_name from employees order by last_name asc) and run EXPLAIN  on it, we can see that both the temporary table and filesort were used. However, we can’t identify if these were applied to DISTINCT, or to ORDER BY, or to any other part of the query.

mysql> explain select distinct last_name from employees order by last_name ascG *************************** 1. row *************************** id: 1 select_type: SIMPLE table: employees partitions: NULL type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 299379 filtered: 100.00 Extra: Using temporary; Using filesort 1 row in set, 1 warning (0.00 sec) Note (Code 1003): /* select#1 */ select distinct `employees`.`employees`.`last_name` AS `last_name` from `employees`.`employees` order by `employees`.`employees`.`last_name`

EXPLAIN FORMAT=JSON tells us exactly what happened:

mysql> explain format=json select distinct last_name from employees order by last_name ascG *************************** 1. row *************************** EXPLAIN: { "query_block": { "select_id": 1, "cost_info": { "query_cost": "360183.80" }, "ordering_operation": { "using_filesort": false, "duplicates_removal": { "using_temporary_table": true, "using_filesort": true, "cost_info": { "sort_cost": "299379.00" }, "table": { "table_name": "employees", "access_type": "ALL", "rows_examined_per_scan": 299379, "rows_produced_per_join": 299379, "filtered": "100.00", "cost_info": { "read_cost": "929.00", "eval_cost": "59875.80", "prefix_cost": "60804.80", "data_read_per_join": "13M" }, "used_columns": [ "emp_no", "last_name" ] } } } } } 1 row in set, 1 warning (0.00 sec) Note (Code 1003): /* select#1 */ select distinct `employees`.`employees`.`last_name` AS `last_name` from `employees`.`employees` order by `employees`.`employees`.`last_name`

In the output above you see can see that ordering_operation does not use filesort:

"ordering_operation": { "using_filesort": false,

But DISTINCT does:

"duplicates_removal": { "using_temporary_table": true, "using_filesort": true,

If we remove the DISTINCT clause, we will find that ORDER BY started using filesort, but does not need to create a temporary table:

mysql> explain format=json select last_name from employees order by last_name ascG *************************** 1. row *************************** EXPLAIN: { "query_block": { "select_id": 1, "cost_info": { "query_cost": "360183.80" }, "ordering_operation": { "using_filesort": true, "cost_info": { "sort_cost": "299379.00" }, <rest of the output skipped>

This means that in the case of the first query, a sorting operation proceeded in parallel with the duplicate keys removal.

Conclusion: EXPLAIN FORMAT=JSON  provides details about ORDER BY  optimization which cannot be seen with a regular EXPLAIN operation.

Apache Spark with Air ontime performance data

January 7, 2016 - 5:28pm

There is a growing interest in Apache Spark, so I wanted to play with it (especially after Alexander Rubin’s Using Apache Spark post).

To start, I used the recently released Apache Spark 1.6.0 for this experiment, and I will play with “Airlines On-Time Performance” database from
http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time. You can find the scripts I used here https://github.com/Percona-Lab/ontime-airline-performance. The uncompressed dataset is about 70GB, which is not really that huge overall, but quite convenient to play with.

As a first step, I converted it to Parquet format. It’s a column based format, suitable for parallel processing, and it supports partitioning.

The script I used was the following:

# bin/spark-shell --packages com.databricks:spark-csv_2.11:1.3.0 val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/data/opt/otp/On_Time_On_Time_Performance_*.csv") sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy") df.write.partitionBy("Year").parquet("/data/flash/spark/otp")

Conveniently, by using just two commands (three if to count setting compression, “snappy” in this case) we can convert ALL of the .csv files into Parquet (doing it in parallel).

The datasize after compression is only 3.5GB, which is a quite impressive compression factor of 20x. I’m guessing the column format with repeatable data allows this compression level.

In general, Apache Spark makes it very easy to handle the Extract, Transform and Load (ETL) process.

Another one of Spark’s attractive features is that it automatically uses all CPU cores and execute complexes in parallel (something MySQL still can’t do). So I wanted to understand how fast it can execute a query compared to MySQL,  and how efficient it is in using multiple cores.

For this I decided to use a query such as:
"SELECT avg(cnt) FROM (SELECT Year,Month,COUNT(*) FROM otp WHERE DepDel15=1 GROUP BY Year,Month) t1"

Which translates to the following Spark DataFrame manipulation:

(pFile.filter("DepDel15=1").groupBy("Year","Month").count()).agg(avg("count")).show()

I should note that Spark is perfectly capable of executing this as SQL query, but I want to learn more about DataFrame manipulation.

The full script I executed is:

val pFile = sqlContext.read.parquet("/mnt/i3600/spark/otp1").cache(); for( a <- 1 to 6){ println("Try: " +a ) val t1=System.currentTimeMillis; (pFile.filter("DepDel15=1").groupBy("Year","Month").count()).agg(avg("count")).show(); val t2=System.currentTimeMillis; println("Try: "+a+" Time: " + (t2-t1)) } exit

And I used the following command line to call the script:

for i in `seq 2 2 48` ; do bin/spark-shell --executor-cores $i -i run.scala | tee -a $i.schema.res ; done

which basically tells it to use from 2 to 48 cores (the server I use has 48 CPU cores) in steps of two.

I executed this same query six times. The first time is a cold run, and data is read from the disk. The rest are hot runs, and the query should be executed from memory (this server has 128GB of RAM, and I allocated 100GB to the Spark executor).

I measured the execution time in cold and hot runs, and how it changed as more cores were added.

There was a lot of variance in the execution time of the hot runs, so I show all the results to demonstrate any trends.

Cold runs:

More cores seem to help, but after a certain point – not so much.

Hot runs:

The best execution time was when 14-22 cores were used. Adding more cores after that, seems to actually make things worse. I would guess that the datasize is small enough so that the communication and coordination overhead cost exceeded the benefits of more parallel processing cores.

Comparing to MySQL

Just to have some points for comparison, I executed the same query in MySQL 5.7 using the following table schema: https://github.com/Percona-Lab/ontime-airline-performance/blob/master/mysql/create_table.sql

The hot execution time for the same query in MySQL (MySQL can use only one CPU core to execute one query) is 350 seconds (or 350,000ms to compare with the data on charts) when using the table without indexes. This is about 11 times worse than the best execution time in Spark.

If we use a small trick and createa  covering index in MySQL designed for this query:

"ALTER TABLE ontime ADD KEY (Year,Month,DepDel15)"

then we can improve execution time to 90 seconds. This is still worse than Spark, but the difference is not as big. We can’t, however, create index for each ad-hoc query, while Spark is capable of processing a variety of queries.

In conclusion, I can say that Spark is indeed an attractive option for data analytics queries
(and in fact it can do much more). It is worth keep in mind, however, that in this experiment
it did not scale well with multiple CPU cores. I wonder if the same problem appears when we use multiple server nodes.

If you have recommendations on how I can improve the results, please post it in comments.

Spark configuration I used (in Standalone cluster setup):

export MASTER=spark://`hostname`:7077 export SPARK_MEM=100g export SPARK_DAEMON_MEMORY=2g export SPARK_LOCAL_DIRS=/mnt/i3600/spark/tmp export SPARK_WORKER_DIR=/mnt/i3600/spark/tmp

MongoDB revs you up: What storage engine is right for you? (Part 1)

January 6, 2016 - 10:31am
Differentiating Between MongoDB Storage Engines

The tremendous data growth of the last decade has affected almost all aspects of applications and application use. Since nearly all applications interact with a database at some point, this means databases needed to adapt to the change in usage conditions as well. Database technology has grown significantly in the last decade to meet the needs of constantly changing applications. Enterprises often need to scale, modify, or replace their databases in order to meet new business demands.

Within a database management system (DBMS), there are many levels that can affect performance, including the choice of your database storage engine. Surprisingly, many enterprises don’t know they have a choice of storage engines, or that specific storage engine types are architected to handle specific scenarios. Often the best option depends on what function the database in question is designed to fulfill.

With Percona’s acquisition of Tokutek, we’ve moved from a mostly-MySQL company to having several MongoDB-based software options available.

MongoDB is a cross-platform, NoSQL, document-oriented database. It doesn’t use the traditional table-based relational database structure, and instead employs JSON-type documents with dynamic schemas. The intention is making the integration of certain application data types easier and faster.

This blog (the first in a series) will briefly review some of the available options for a MongoDB database storage engine, and the pros and cons of each. Hopefully it will help database administrators, IT staff, and enterprises realize that when it comes to MongoDB, you aren’t limited to a single storage engine choice.

What is a Storage Engine?

A database storage engine is the underlying software that a DBMS uses to create, read, update and delete data from a database. The storage engine should be thought of as a “bolt on” to the database (server daemon), which controls the database’s interaction with memory and storage subsystems. Thus, the storage engine is not actually the database, but a service that the database consumes for the storage and retrieval of information. Given that the storage engine is responsible for managing the information stored in the database, it greatly affects the overall performance of the database (or lack thereof, if the wrong engine is chosen).

Most storage engines are organized using one of the following structures: a Log-Structured Merge (LSM) tree, B-Tree or Fractal tree.

  • LSM Tree. An LSM tree has performance characteristics that make it attractive for providing indexed access to files with high insert volume. LSM trees seek to provide the excellent insertion performance of log type storage engines, while minimizing the impact of searches in a data structure that is “sorted” strictly on insertion order. LSMs buffer inserts, updates and deletes by using layers of logs that increase in size, and then get merged in sorted order to increase the efficiency of searches.
  • B-Tree. B-Trees are the most commonly implemented data structure in databases. Having been around since the early 1970’s, they are one of the most time-tested storage engine “methodologies.” B-Trees method of data maintenance makes searches very efficient. However, the need to maintain a well-ordered data structure can have a detrimental effect on insertion performance.
  • Fractal Tree. A Fractal Tree index is a tree data structure much like that of a B-tree (designed for efficient searches), but also ingests data into log-like structures for efficient memory usage in order to facilitate high-insertion performance. Fractal Trees were designed to ingest data at high rates of speed in order to interact efficiently with the storage for high bandwidth applications.

Fractal Trees and the LSM trees sound very similar. The main differentiating factor, however, is the manner in which they sort the data into the tree for efficient searches. LSM trees merge data into a tree from a series of logs as the logs fill up. Fractal Trees sort data into log-like structures (message buffers) along the proper data path in the tree.

What storage engine is best?

That question is not a simple one. In order decide which engine to choose, it’s necessary to determine the core functionality provided in each engine. Core functionality can generally be aggregated into three areas:

  • Locking types. Locking within database engines defines how access and updates to information are controlled. When an object in the database is locked for updating, other processes cannot modify (or in some cases read) the data until the update has completed. Locking not only affects how many different applications can update the information in the database, it can also affect queries on that data. It is important to monitor how queries access data, as the data could be altered or updated as it is being accessed. In general, such delays are minimal. The bulk of the locking mechanism is devoted to preventing multiple processes updating the same data. Since both additions (INSERT statements) and alterations (UPDATE statements) to the data require locking, you can imagine that multiple applications using the same database can have a significant impact. Thus, the “granularity” of the locking mechanism can drastically affect the throughput of the database in “multi-user” (or “highly-concurrent”) environments.
  • Indexing. The indexing method can dramatically increase database performance when searching and recovering data. Different storage engines provide different indexing techniques, and some may be better suited for the type of data you are storing. Typically, every index defined on a collection is another data structure of the particular type the engine uses (B-tree for WiredTiger, Fractal Tree for PerconaFT, and so forth). The efficiency of that data structure in relation to your workload is very important. An easy way of thinking about it is viewing every extra index as having performance overhead. A data structure that is write-optimized will have lower overhead for every index in a high-insert application environment than a non-write optimized data structure would. For use cases that require a large number of indexes, choosing an appropriate storage engine can have a dramatic impact.
  • Transactions. Transactions provide data reliability during the update or insert of information by enabling you to add data to the database, but only to commit that data when other conditions and stages in the application execution have completed successfully. For example, when transferring information (like a monetary credit) from one account to another, you would use transactions to ensure that both the debit from one account and the credit to the other completed successfully. Often, you will hear this referred to as “atomicity.” This means the operations that are bundled together are an immutable unit: either all operations complete successfully, or none do. Despite the ability of RocksDB, PerconaFT and WiredTiger to support transactions, as of version 3.2 this functionality is not available in the MongoDB storage engine API. Multi-document transactions cannot be used in MongoDB. However, atomicity can be achieved at the single document level. According to statements from MongoDB, Inc., multi-document transactions will be supported in the future, but a firm date has not been set as of this writing.

Now that we’ve established a general framework, we’ll move onto discussing engines. For the first blog in this series, we’ll look at MMAPv1 (the default storage engine that comes with MongoDB up until the release 3.0).

MMAPv1

Find it in: MongoDB or Percona builds

MMAPv1 is MongoDB’s original storage engine, and was the default engine in MongoDB 3.0 and earlier. It is a B-tree based system that offloads much of the functions of storage interaction and memory management to the operating system. MongoDB is based on memory mapped files.

The MMAP storage engine uses a process called “record allocation” to grab disk space for document storage. All records are contiguously located on disk, and when a document becomes larger than the allocated record, it must allocate a new record. New allocations require moving a document and updating all indexes that refer to the document, which takes more time than in-place updates and leads to storage fragmentation. Furthermore, MMAPv1 in it’s current iterations usually leads to high space utilization on your filesystem due to over-allocation of record space and it’s lack of support for compression.

As mentioned previously, a storage engine’s locking scheme is one of the most important factors in overall database performance. MMAPv1 has collection-level locking – meaning only one insert, update or delete operation can use a collection at a time. This type of locking scheme creates a very common scenario in concurrent workloads, where update/delete/insert operations are always waiting for the operation(s) in front of them to complete. Furthermore, oftentimes those operations are flowing in more quickly than they can be completed in serial fashion by the storage engine. To put it in context, imagine a giant supermarket on Sunday afternoon that only has one checkout line open: plenty of customers, but low throughput!

Given the storage engine choices brought about by the storage engine API in MongoDB 3.0, it is hard to imagine an application that demands the MMAPv1 storage engine for optimized performance. If you read between the lines, you could conclude that MongoDB, Inc. would agree given that the default engine was switched to WiredTiger in v3.2.

Conclusion

Most people don’t know that they have a choice when it comes to storage engines, and that the choice should be based on what the database workload will look like. Percona’s Vadim Tkachenko performed an excellent benchmark test comparing the performances of RocksDB, PerconaFT and WiredTiger to help specifically differentiate between these engines.

In the next post, we’ll examine the ins and outs of MongoDB’s new default storage engine, WiredTiger.

 

 

The post MongoDB revs you up: What storage engine is right for you? (Part 1) appeared first on MySQL Performance Blog.

grouping_operation, duplicates_removal: EXPLAIN FORMAT=JSON has all details about GROUP BY

January 4, 2016 - 5:45pm

In the previous EXPLAIN FORMAT=JSON is Cool! series blog post, we discussed the  group_by_subqueries  member (which is child of grouping_operation). Let’s now focus on the grouping_operation  and other details of GROUP BY  processing.

grouping_operation simply shows the details of what happens when the GROUP BY clause is run:

mysql> explain format=json select dept_no from dept_emp group by dept_noG *************************** 1. row *************************** EXPLAIN: { "query_block": { "select_id": 1, "cost_info": { "query_cost": "14.40" }, "grouping_operation": { "using_filesort": false, "table": { "table_name": "dept_emp", "access_type": "range", "possible_keys": [ "PRIMARY", "emp_no", "dept_no" ], "key": "dept_no", "used_key_parts": [ "dept_no" ], "key_length": "4", "rows_examined_per_scan": 9, "rows_produced_per_join": 9, "filtered": "100.00", "using_index_for_group_by": true, "cost_info": { "read_cost": "12.60", "eval_cost": "1.80", "prefix_cost": "14.40", "data_read_per_join": "144" }, "used_columns": [ "emp_no", "dept_no" ] } } } } 1 row in set, 1 warning (0.00 sec) Note (Code 1003): /* select#1 */ select `employees`.`dept_emp`.`dept_no` AS `dept_no` from `employees`.`dept_emp` group by `employees`.`dept_emp`.`dept_no`

In the listing above, you can see which table was accessed by the GROUP BY operation, the access type, and if an index for GROUP BY was used.

In case of a simple JOIN  of two tables, grouping_operation is usually a parent for the nested_loop  object (which provides details on how the JOIN  proceeded):

mysql> explain format=json select de.dept_no, count(dm.emp_no) from dept_emp de join dept_manager dm using(emp_no) group by de.dept_noG *************************** 1. row *************************** EXPLAIN: { "query_block": { "select_id": 1, "cost_info": { "query_cost": "61.50" }, "grouping_operation": { "using_temporary_table": true, "using_filesort": true, "cost_info": { "sort_cost": "26.41" }, "nested_loop": [ { "table": { "table_name": "dm", "access_type": "index", "possible_keys": [ "PRIMARY", "emp_no" ], "key": "emp_no", "used_key_parts": [ "emp_no" ], "key_length": "4", "rows_examined_per_scan": 24, "rows_produced_per_join": 24, "filtered": "100.00", "using_index": true, "cost_info": { "read_cost": "1.00", "eval_cost": "4.80", "prefix_cost": "5.80", "data_read_per_join": "384" }, "used_columns": [ "dept_no", "emp_no" ] } }, { "table": { "table_name": "de", "access_type": "ref", "possible_keys": [ "PRIMARY", "emp_no", "dept_no" ], "key": "emp_no", "used_key_parts": [ "emp_no" ], "key_length": "4", "ref": [ "employees.dm.emp_no" ], "rows_examined_per_scan": 1, "rows_produced_per_join": 26, "filtered": "100.00", "using_index": true, "cost_info": { "read_cost": "24.00", "eval_cost": "5.28", "prefix_cost": "35.09", "data_read_per_join": "422" }, "used_columns": [ "emp_no", "dept_no" ] } } ] } } } 1 row in set, 1 warning (0.00 sec) Note (Code 1003): /* select#1 */ select `employees`.`de`.`dept_no` AS `dept_no`,count(`employees`.`dm`.`emp_no`) AS `count(dm.emp_no)` from `employees`.`dept_emp` `de` join `employees`.`dept_manager` `dm` where (`employees`.`de`.`emp_no` = `employees`.`dm`.`emp_no`) group by `employees`.`de`.`dept_no`

Surprisingly, while many DISTINCT queries can be converted into equivalent queries with the GROUP BY clause, there is separate member (duplicates_removal) for processing it. Let’s see how it works with a simple query that performs the same job as the first one in this blog post:

mysql> explain format=json select distinct dept_no from dept_empG *************************** 1. row *************************** EXPLAIN: { "query_block": { "select_id": 1, "cost_info": { "query_cost": "14.40" }, "duplicates_removal": { "using_filesort": false, "table": { "table_name": "dept_emp", "access_type": "range", "possible_keys": [ "PRIMARY", "emp_no", "dept_no" ], "key": "dept_no", "used_key_parts": [ "dept_no" ], "key_length": "4", "rows_examined_per_scan": 9, "rows_produced_per_join": 9, "filtered": "100.00", "using_index_for_group_by": true, "cost_info": { "read_cost": "12.60", "eval_cost": "1.80", "prefix_cost": "14.40", "data_read_per_join": "144" }, "used_columns": [ "emp_no", "dept_no" ] } } } } 1 row in set, 1 warning (0.00 sec) Note (Code 1003): /* select#1 */ select distinct `employees`.`dept_emp`.`dept_no` AS `dept_no` from `employees`.`dept_emp`

You can see that the plan is almost same, but parent element for the plan is duplicates_removal.

The reason there are differences between these members can be seen if we change the second, more complicated query to use DISTINCT in place of GROUP BY:

mysql> explain format=json select distinct de.dept_no, count(dm.emp_no) from dept_emp de join dept_manager dm using(emp_no)G ERROR 1140 (42000): In aggregated query without GROUP BY, expression #1 of SELECT list contains nonaggregated column 'employees.de.dept_no'; this is incompatible with sql_mode=only_full_group_by

This example shows that DISTINCT is not exactly same as GROUP BY, and can be used together  if we want to count the number of managers in each department (grouped by the year when the manager started working in the department). In this case, however, we are interested only in unique pairs of such dates and don’t want to see duplicates. Duplicates will appear if one person managed same department more than two years.

mysql> explain format=json select distinct de.dept_no, count(dm.emp_no) from dept_emp de join dept_manager dm using(emp_no) group by de.dept_no, year(de.from_date)G *************************** 1. row *************************** EXPLAIN: { "query_block": { "select_id": 1, "cost_info": { "query_cost": "61.63" }, "duplicates_removal": { "using_temporary_table": true, "using_filesort": false, "grouping_operation": { "using_temporary_table": true, "using_filesort": true, "cost_info": { "sort_cost": "26.53" }, "nested_loop": [ { "table": { "table_name": "dm", "access_type": "index", "possible_keys": [ "PRIMARY", "emp_no" ], "key": "emp_no", "used_key_parts": [ "emp_no" ], "key_length": "4", "rows_examined_per_scan": 24, "rows_produced_per_join": 24, "filtered": "100.00", "using_index": true, "cost_info": { "read_cost": "1.00", "eval_cost": "4.80", "prefix_cost": "5.80", "data_read_per_join": "384" }, "used_columns": [ "dept_no", "emp_no" ] } }, { "table": { "table_name": "de", "access_type": "ref", "possible_keys": [ "PRIMARY", "emp_no" ], "key": "PRIMARY", "used_key_parts": [ "emp_no" ], "key_length": "4", "ref": [ "employees.dm.emp_no" ], "rows_examined_per_scan": 1, "rows_produced_per_join": 26, "filtered": "100.00", "cost_info": { "read_cost": "24.00", "eval_cost": "5.31", "prefix_cost": "35.11", "data_read_per_join": "424" }, "used_columns": [ "emp_no", "dept_no", "from_date" ] } } ] } } } } 1 row in set, 1 warning (0.00 sec) Note (Code 1003): /* select#1 */ select distinct `employees`.`de`.`dept_no` AS `dept_no`,count(`employees`.`dm`.`emp_no`) AS `count(dm.emp_no)` from `employees`.`dept_emp` `de` join `employees`.`dept_manager` `dm` where (`employees`.`de`.`emp_no` = `employees`.`dm`.`emp_no`) group by `employees`.`de`.`dept_no`,year(`employees`.`de`.`from_date`)

In this case, the member grouping_operation is a child of duplicates_removal and the temporary table used to store the result of GROUP BY  before removing the duplicates. A temporary table was also used to perform a filesort for the grouping operation itself.

Compare this with regular EXPLAIN output. EXPLAIN only shows that a temporary table was used, but does not provide insights on the operations for which it was used:

mysql> explain select distinct de.dept_no, count(dm.emp_no) from dept_emp de join dept_manager dm using(emp_no) group by de.dept_no, year(de.from_date)G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: dm partitions: NULL type: index possible_keys: PRIMARY,emp_no key: emp_no key_len: 4 ref: NULL rows: 24 filtered: 100.00 Extra: Using index; Using temporary; Using filesort *************************** 2. row *************************** id: 1 select_type: SIMPLE table: de partitions: NULL type: ref possible_keys: PRIMARY,emp_no key: PRIMARY key_len: 4 ref: employees.dm.emp_no rows: 1 filtered: 100.00 Extra: NULL 2 rows in set, 1 warning (0.01 sec) Note (Code 1003): /* select#1 */ select distinct `employees`.`de`.`dept_no` AS `dept_no`,count(`employees`.`dm`.`emp_no`) AS `count(dm.emp_no)` from `employees`.`dept_emp` `de` join `employees`.`dept_manager` `dm` where (`employees`.`de`.`emp_no` = `employees`.`dm`.`emp_no`) group by `employees`.`de`.`dept_no`,year(`employees`.`de`.`from_date`)

Conclusion:  EXPLAIN FORMAT=JSON contains all the details about the GROUP BY and DISTINCT  optimizations.

The post grouping_operation, duplicates_removal: EXPLAIN FORMAT=JSON has all details about GROUP BY appeared first on MySQL Performance Blog.

Percona Server for MongoDB 3.0.8-1.2 is now available

January 4, 2016 - 11:37am

Percona is pleased to announce the release of Percona Server for MongoDB 3.0.8-1.2 on January 4, 2016. Download the latest version from the Percona web site or from the Percona Software Repositories.

Percona Server for MongoDB is an enhanced, open source, fully compatible, highly scalable, zero-maintenance downtime database supporting the MongoDB v3.0 protocol and drivers. Based on MongoDB 3.0.8, it extends MongoDB with MongoRocks and PerconaFT storage engines, as well as features like external authentication and audit logging. Percona Server for MongoDB requires no changes to MongoDB applications or code.

 

New Features:

  • Added support for Ubuntu 15.10 (Wily Werewolf)
  • Contains all changes and fixes from MongoDB 3.0.8

Percona Server for MongoDB 3.0.8-1.2 release notes are available in the official documentation.

The post Percona Server for MongoDB 3.0.8-1.2 is now available appeared first on MySQL Performance Blog.

Database Performance Webinar: Tired of MySQL Making You Wait?

December 30, 2015 - 12:05pm

Too often developers and DBAs struggle to pinpoint the root cause of MySQL database performance issues, and then spend too much time in trying to fix them. Wouldn’t it be great to bypass wasted guesswork and get right to the issue?

In our upcoming webinar Tired of MySQL Making You Wait? we’re going to help you discover how to significantly increase the performance of your applications and reduce database response time.

In this webinar, Principal Architect Alexander Rubin and Database Evangelist Janis Griffin will provide the key steps needed to identify, prioritize, and improve query performance.

They will discuss the following topics:

  • Wait time analytics using Performance / Information schemas
  • Monitoring for performance using DPA
  • Explaining plan operations focusing on temporary tables and filesort
  • Using indexes to optimize your queries
  • Using loose and tight index scans in MySQL

WHEN:

Thursday, January 7, 2016 10:00am Pacific Standard Time (UTC – 8)

PRESENTERS:

Alexander RubinPrincipal Consultant, Percona

Janis GriffinDatabase Evangelist, SolarWinds

Register now!

Percona is the only company that delivers enterprise-class software, support, consulting and managed services solutions for both MySQL and MongoDB® across traditional and cloud-based platforms that maximize application performance while streamlining database efficiencies.

Percona’s industry-recognized performance experts can maximize your database, server and application performance, lower infrastructure costs, and provide capacity and scalability planning for future growth.

The post Database Performance Webinar: Tired of MySQL Making You Wait? appeared first on MySQL Performance Blog.

2016 Percona Live Tutorials Schedule is UP!

December 29, 2015 - 1:24pm

We are excited to announce that the tutorial schedule for the Percona Live Data Performance Conference 2016 is up!

The schedule shows all the details for each of our informative and enlightening Percona Live tutorial sessions, including insights into InnoDB, MySQL 5.7, MongoDB 3.2 and RocksDB. These tutorials are a must for any data performance professional!

The Percona Live Data Performance Conference is the premier open source event for the data performance ecosystem. It is the place to be for the open source community as well as businesses that thrive in the MySQL, NoSQL, cloud, big data and Internet of Things (IoT) marketplaces. Attendees include DBAs, sysadmins, developers, architects, CTOs, CEOs, and vendors from around the world.

The sneak peek schedule for Percona Live 2016 has also been posted! The Conference will feature a variety of formal tracks and sessions related to MySQL, NoSQL and Data in the Cloud. With over 150 slots to fill, there will be no shortage of great content this year.

The Percona Live Data Performance Conference will be April 18-21 at the Hyatt Regency Santa Clara & The Santa Clara Convention Center.

Just a reminder to everyone out there: our Super Saver discount rate for the Percona Live Data Performance and Expo 2016 is only available ‘til December 31 11:30pm PST! This rate gets you all the excellent and amazing opportunities that Percona Live offers, at the lowest price possible!

Become a conference sponsor! We have sponsorship opportunities available for this annual MySQL, NoSQL and Data in the Cloud event. Sponsors become a part of a dynamic and growing ecosystem and interact with more than 1,000 DBAs, sysadmins, developers, CTOs, CEOs, business managers, technology evangelists, solutions vendors, and entrepreneurs who attend the event.

Click through to the tutorial link right now, look them over, and pick which sessions you want to attend!

The post 2016 Percona Live Tutorials Schedule is UP! appeared first on MySQL Performance Blog.

EXPLAIN FORMAT=JSON: order_by_subqueries, group_by_subqueries details on subqueries in ORDER BY and GROUP BY

December 29, 2015 - 12:39pm

Another post in the EXPLAIN FORMAT=JSON is Cool! series! In this post, we’ll discuss how the EXPLAIN FORMAT=JSON provides optimization details for ORDER BY and  GROUP BY operations in conjunction with order_by_subqueries and  group_by_subqueries. 

EXPLAIN FORMAT=JSON can print details on how a subquery in ORDER BY is optimized:

mysql> explain format=json select emp_no, concat(first_name, ' ', last_name) f2 from employees order by (select emp_no limit 1)G *************************** 1. row *************************** EXPLAIN: { "query_block": { "select_id": 1, "cost_info": { "query_cost": "60833.60" }, "ordering_operation": { "using_filesort": true, "table": { "table_name": "employees", "access_type": "ALL", "rows_examined_per_scan": 299843, "rows_produced_per_join": 299843, "filtered": "100.00", "cost_info": { "read_cost": "865.00", "eval_cost": "59968.60", "prefix_cost": "60833.60", "data_read_per_join": "13M" }, "used_columns": [ "emp_no", "first_name", "last_name" ] }, "order_by_subqueries": [ { "dependent": true, "cacheable": false, "query_block": { "select_id": 2, "message": "No tables used" } } ] } } } 1 row in set, 2 warnings (0.00 sec) Note (Code 1276): Field or reference 'employees.employees.emp_no' of SELECT #2 was resolved in SELECT #1 Note (Code 1003): /* select#1 */ select `employees`.`employees`.`emp_no` AS `emp_no`,concat(`employees`.`employees`.`first_name`,' ',`employees`.`employees`.`last_name`) AS `f2` from `employees`.`employees` order by (/* select#2 */ select `employees`.`employees`.`emp_no` limit 1)

The above code shows member ordering_operation of query_block  (which includes the order_by_subqueries array) with information on how the subquery in ORDER BY  was optimized.

This is a simple example. In real life you can have larger subqueries in the ORDER BY  clause. For example, take this more complicated and slightly crazy query:

select emp_no, concat(first_name, ' ', last_name) f2 from employees order by (select dept_no as c from salaries join dept_emp using (emp_no) group by dept_no)

Run a regular EXPLAIN on it. If we imagine this is a regular subquery, we won’t know if it can be cached or would be executed for each row sorted.

mysql> explain select emp_no, concat(first_name, ' ', last_name) f2 from employees order by (select dept_no as c from salaries join dept_emp using (emp_no) group by dept_no)G *************************** 1. row *************************** id: 1 select_type: PRIMARY table: employees partitions: NULL type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 299843 filtered: 100.00 Extra: NULL *************************** 2. row *************************** id: 2 select_type: SUBQUERY table: dept_emp partitions: NULL type: index possible_keys: PRIMARY,emp_no,dept_no key: dept_no key_len: 4 ref: NULL rows: 331215 filtered: 100.00 Extra: Using index *************************** 3. row *************************** id: 2 select_type: SUBQUERY table: salaries partitions: NULL type: ref possible_keys: PRIMARY,emp_no key: emp_no key_len: 4 ref: employees.dept_emp.emp_no rows: 10 filtered: 100.00 Extra: Using index 3 rows in set, 1 warning (0.00 sec) Note (Code 1003): /* select#1 */ select `employees`.`employees`.`emp_no` AS `emp_no`,concat(`employees`.`employees`.`first_name`,' ',`employees`.`employees`.`last_name`) AS `f2` from `employees`.`employees` order by (/* select#2 */ select `employees`.`dept_emp`.`dept_no` AS `c` from `employees`.`salaries` join `employees`.`dept_emp` where (`employees`.`salaries`.`emp_no` = `employees`.`dept_emp`.`emp_no`) group by `employees`.`dept_emp`.`dept_no`)

EXPLAIN FORMAT=JSON  provides a completely different picture:

mysql> explain format=json select emp_no, concat(first_name, ' ', last_name) f2 from employees order by (select dept_no as c from salaries join dept_emp using (emp_no) group by dept_no)G *************************** 1. row *************************** EXPLAIN: { "query_block": { "select_id": 1, "cost_info": { "query_cost": "60833.60" }, "ordering_operation": { "using_filesort": false, "table": { "table_name": "employees", "access_type": "ALL", "rows_examined_per_scan": 299843, "rows_produced_per_join": 299843, "filtered": "100.00", "cost_info": { "read_cost": "865.00", "eval_cost": "59968.60", "prefix_cost": "60833.60", "data_read_per_join": "13M" }, "used_columns": [ "emp_no", "first_name", "last_name" ] }, "optimized_away_subqueries": [ { "dependent": false, "cacheable": true, "query_block": { "select_id": 2, "cost_info": { "query_cost": "1082124.21" }, "grouping_operation": { "using_filesort": false, "nested_loop": [ { "table": { "table_name": "dept_emp", "access_type": "index", "possible_keys": [ "PRIMARY", "emp_no", "dept_no" ], "key": "dept_no", "used_key_parts": [ "dept_no" ], "key_length": "4", "rows_examined_per_scan": 331215, "rows_produced_per_join": 331215, "filtered": "100.00", "using_index": true, "cost_info": { "read_cost": "673.00", "eval_cost": "66243.00", "prefix_cost": "66916.00", "data_read_per_join": "5M" }, "used_columns": [ "emp_no", "dept_no" ] } }, { "table": { "table_name": "salaries", "access_type": "ref", "possible_keys": [ "PRIMARY", "emp_no" ], "key": "emp_no", "used_key_parts": [ "emp_no" ], "key_length": "4", "ref": [ "employees.dept_emp.emp_no" ], "rows_examined_per_scan": 10, "rows_produced_per_join": 3399374, "filtered": "100.00", "using_index": true, "cost_info": { "read_cost": "335333.33", "eval_cost": "679874.87", "prefix_cost": "1082124.21", "data_read_per_join": "51M" }, "used_columns": [ "emp_no", "from_date" ] } } ] } } } ] } } } 1 row in set, 1 warning (0.00 sec) Note (Code 1003): /* select#1 */ select `employees`.`employees`.`emp_no` AS `emp_no`,concat(`employees`.`employees`.`first_name`,' ',`employees`.`employees`.`last_name`) AS `f2` from `employees`.`employees` order by (/* select#2 */ select `employees`.`dept_emp`.`dept_no` AS `c` from `employees`.`salaries` join `employees`.`dept_emp` where (`employees`.`salaries`.`emp_no` = `employees`.`dept_emp`.`emp_no`) group by `employees`.`dept_emp`.`dept_no`)

We see that the subquery was optimized away: member optimized_away_subqueries exists, but there is no order_by_subqueries in the ordering_operation object. We can also see that the subquery was cached: "cacheable": true.

EXPLAIN FORMAT=JSON also provides information about subqueries in the GROUP BY clause. It uses the group_by_subqueries array in the grouping_operation  member for this purpose.

mysql> explain format=json select count(emp_no) from salaries group by salary > ALL (select s/c as avg_salary from (select dept_no, sum(salary) as s, count(emp_no) as c from salaries join dept_emp using (emp_no) group by dept_no) t)G *************************** 1. row *************************** EXPLAIN: { "query_block": { "select_id": 1, "cost_info": { "query_cost": "3412037.60" }, "grouping_operation": { "using_temporary_table": true, "using_filesort": true, "cost_info": { "sort_cost": "2838638.00" }, "table": { "table_name": "salaries", "access_type": "ALL", "rows_examined_per_scan": 2838638, "rows_produced_per_join": 2838638, "filtered": "100.00", "cost_info": { "read_cost": "5672.00", "eval_cost": "567727.60", "prefix_cost": "573399.60", "data_read_per_join": "43M" }, "used_columns": [ "emp_no", "salary", "from_date" ] }, "group_by_subqueries": [ { "dependent": true, "cacheable": false, "query_block": { "select_id": 2, "cost_info": { "query_cost": "881731.00" }, "table": { "table_name": "t", "access_type": "ALL", "rows_examined_per_scan": 3526884, "rows_produced_per_join": 3526884, "filtered": "100.00", "cost_info": { "read_cost": "176354.20", "eval_cost": "705376.80", "prefix_cost": "881731.00", "data_read_per_join": "134M" }, "used_columns": [ "dept_no", "s", "c" ], "attached_condition": "((<cache>(`employees`.`salaries`.`salary`) <= (`t`.`s` / `t`.`c`)) or isnull((`t`.`s` / `t`.`c`)))", "materialized_from_subquery": { "using_temporary_table": true, "dependent": false, "cacheable": true, "query_block": { "select_id": 3, "cost_info": { "query_cost": "1106758.94" }, "grouping_operation": { "using_filesort": false, "nested_loop": [ { "table": { "table_name": "dept_emp", "access_type": "index", "possible_keys": [ "PRIMARY", "emp_no", "dept_no" ], "key": "dept_no", "used_key_parts": [ "dept_no" ], "key_length": "4", "rows_examined_per_scan": 331215, "rows_produced_per_join": 331215, "filtered": "100.00", "using_index": true, "cost_info": { "read_cost": "673.00", "eval_cost": "66243.00", "prefix_cost": "66916.00", "data_read_per_join": "5M" }, "used_columns": [ "emp_no", "dept_no" ] } }, { "table": { "table_name": "salaries", "access_type": "ref", "possible_keys": [ "PRIMARY", "emp_no" ], "key": "PRIMARY", "used_key_parts": [ "emp_no" ], "key_length": "4", "ref": [ "employees.dept_emp.emp_no" ], "rows_examined_per_scan": 10, "rows_produced_per_join": 3526884, "filtered": "100.00", "cost_info": { "read_cost": "334466.14", "eval_cost": "705376.80", "prefix_cost": "1106758.95", "data_read_per_join": "53M" }, "used_columns": [ "emp_no", "salary", "from_date" ] } } ] } } } } } } ] } } } 1 row in set, 1 warning (0.01 sec) Note (Code 1003): /* select#1 */ select count(`employees`.`salaries`.`emp_no`) AS `count(emp_no)` from `employees`.`salaries` group by <not>(<in_optimizer>(`employees`.`salaries`.`salary`,<exists>(/* select#2 */ select 1 from (/* select#3 */ select `employees`.`dept_emp`.`dept_no` AS `dept_no`,sum(`employees`.`salaries`.`salary`) AS `s`,count(`employees`.`salaries`.`emp_no`) AS `c` from `employees`.`salaries` join `employees`.`dept_emp` where (`employees`.`salaries`.`emp_no` = `employees`.`dept_emp`.`emp_no`) group by `employees`.`dept_emp`.`dept_no`) `t` where ((<cache>(`employees`.`salaries`.`salary`) <= (`t`.`s` / `t`.`c`)) or isnull((`t`.`s` / `t`.`c`))) having <is_not_null_test>((`t`.`s` / `t`.`c`)))))

Again, this output gives a clear view of query optimization: subquery in GROUP BY itself cannot be optimized, cached or converted into temporary table, but the subquery inside the subquery (select dept_no, sum(salary) as s, count(emp_no) as c from salaries join dept_emp using (emp_no) group by dept_no) could be materialized into a temporary table and cached.

A regular EXPLAIN command does not provide such details:

mysql> explain select count(emp_no) from salaries group by salary > ALL (select s/c as avg_salary from (select dept_no, sum(salary) as s, count(emp_no) as c from salaries join dept_emp using (emp_no) group by dept_no) t)G *************************** 1. row *************************** id: 1 select_type: PRIMARY table: salaries partitions: NULL type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 2838638 filtered: 100.00 Extra: Using temporary; Using filesort *************************** 2. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: <derived3> partitions: NULL type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 3526884 filtered: 100.00 Extra: Using where *************************** 3. row *************************** id: 3 select_type: DERIVED table: dept_emp partitions: NULL type: index possible_keys: PRIMARY,emp_no,dept_no key: dept_no key_len: 4 ref: NULL rows: 331215 filtered: 100.00 Extra: Using index *************************** 4. row *************************** id: 3 select_type: DERIVED table: salaries partitions: NULL type: ref possible_keys: PRIMARY,emp_no key: PRIMARY key_len: 4 ref: employees.dept_emp.emp_no rows: 10 filtered: 100.00 Extra: NULL 4 rows in set, 1 warning (0.01 sec) Note (Code 1003): /* select#1 */ select count(`employees`.`salaries`.`emp_no`) AS `count(emp_no)` from `employees`.`salaries` group by <not>(<in_optimizer>(`employees`.`salaries`.`salary`,<exists>(/* select#2 */ select 1 from (/* select#3 */ select `employees`.`dept_emp`.`dept_no` AS `dept_no`,sum(`employees`.`salaries`.`salary`) AS `s`,count(`employees`.`salaries`.`emp_no`) AS `c` from `employees`.`salaries` join `employees`.`dept_emp` where (`employees`.`salaries`.`emp_no` = `employees`.`dept_emp`.`emp_no`) group by `employees`.`dept_emp`.`dept_no`) `t` where ((<cache>(`employees`.`salaries`.`salary`) <= (`t`.`s` / `t`.`c`)) or isnull((`t`.`s` / `t`.`c`))) having <is_not_null_test>((`t`.`s` / `t`.`c`)))))

Most importantly, we cannot guess from the output if the DERIVED subquery can be cached.

Conlcusion: EXPLAIN FORMAT=JSON  provides details on how subqueries in ORDER BY and GROUP BY clauses are optimized.

The post EXPLAIN FORMAT=JSON: order_by_subqueries, group_by_subqueries details on subqueries in ORDER BY and GROUP BY appeared first on MySQL Performance Blog.

EXPLAIN FORMAT=JSON provides insights on optimizer_switch effectiveness

December 28, 2015 - 11:20am

The previous post in the EXPLAIN FORMAT=JSON is Cool! series showed an example of the query select dept_name from departments where dept_no in (select dept_no from dept_manager where to_date is not null), where the subquery was materialized into a temporary table and then joined with the outer query. This is known as a semi-join optimization. But what happens if we turn off this optimization?

EXPLAIN FORMAT=JSON can help us with this investigation too.

First lets look at the original output again:

mysql> explain format=json select dept_name from departments where dept_no in (select dept_no from dept_manager where to_date is not null)G *************************** 1. row *************************** EXPLAIN: {   "query_block": {     "select_id": 1,     "cost_info": {       "query_cost": "16.72"     },     "nested_loop": [       {         "table": {           "table_name": "departments",           <skipped>       },       {         "table": {           "table_name": "<subquery2>",           "access_type": "eq_ref",           "key": "<auto_key>",           "key_length": "4",           "ref": [             "employees.departments.dept_no"           ],           "rows_examined_per_scan": 1,           "materialized_from_subquery": {             "using_temporary_table": true,             "query_block": {               "table": {                 "table_name": "dept_manager",                 "access_type": "ALL",                 "possible_keys": [                   "dept_no"                 ],                 "rows_examined_per_scan": 24,                 "rows_produced_per_join": 21,                 "filtered": "90.00",                 "cost_info": {                   "read_cost": "1.48",                   "eval_cost": "4.32",                   "prefix_cost": "5.80",                   "data_read_per_join": "345"                 },                 "used_columns": [                   "dept_no",                   "to_date"                 ],                 "attached_condition": "(`employees`.`dept_manager`.`to_date` is not null)"               }             }           }         }       }     ]   } } 1 row in set, 1 warning (0.00 sec) Note (Code 1003): /* select#1 */ select `employees`.`departments`.`dept_name` AS `dept_name` from `employees`.`departments` semi join (`employees`.`dept_manager`) where ((`<subquery2>`.`dept_no` = `employees`.`departments`.`dept_no`) and (`employees`.`dept_manager`.`to_date` is not null))

To repeat what happened here: the subquery was materialized into a  temporary table, then  joined with the departments table. Semi-join optimization is ON by default (as would be most likely without intervention).

What happens if we temporarily turn semi-join optimization OFF?

mysql> set optimizer_switch="semijoin=off"; Query OK, 0 rows affected (0.00 sec)

And then execute EXPLAIN one more time:

mysql> explain format=json select dept_name from departments where dept_no in (select dept_no from dept_manager where to_date is not null) G *************************** 1. row *************************** EXPLAIN: { "query_block": { "select_id": 1, "cost_info": { "query_cost": "2.80" }, "table": { "table_name": "departments", "access_type": "index", "key": "dept_name", "used_key_parts": [ "dept_name" ], "key_length": "42", "rows_examined_per_scan": 9, "rows_produced_per_join": 9, "filtered": "100.00", "using_index": true, "cost_info": { "read_cost": "1.00", "eval_cost": "1.80", "prefix_cost": "2.80", "data_read_per_join": "432" }, "used_columns": [ "dept_no", "dept_name" ], "attached_condition": "<in_optimizer>(`employees`.`departments`.`dept_no`,`employees`.`departments`.`dept_no` in ( <materialize> (/* select#2 */ select `employees`.`dept_manager`.`dept_no` from `employees`.`dept_manager` where (`employees`.`dept_manager`.`to_date` is not null) ), <primary_index_lookup>(`employees`.`departments`.`dept_no` in <temporary table> on <auto_key> where ((`employees`.`departments`.`dept_no` = `materialized-subquery`.`dept_no`)))))", "attached_subqueries": [ { "table": { "table_name": "<materialized_subquery>", "access_type": "eq_ref", "key": "<auto_key>", "key_length": "4", "rows_examined_per_scan": 1, "materialized_from_subquery": { "using_temporary_table": true, "dependent": true, "cacheable": false, "query_block": { "select_id": 2, "cost_info": { "query_cost": "5.80" }, "table": { "table_name": "dept_manager", "access_type": "ALL", "possible_keys": [ "dept_no" ], "rows_examined_per_scan": 24, "rows_produced_per_join": 21, "filtered": "90.00", "cost_info": { "read_cost": "1.48", "eval_cost": "4.32", "prefix_cost": "5.80", "data_read_per_join": "345" }, "used_columns": [ "dept_no", "to_date" ], "attached_condition": "(`employees`.`dept_manager`.`to_date` is not null)" } } } } } ] } } } 1 row in set, 1 warning (0.00 sec) Note (Code 1003): /* select#1 */ select `employees`.`departments`.`dept_name` AS `dept_name` from `employees`.`departments` where <in_optimizer>(`employees`.`departments`.`dept_no`,`employees`.`departments`.`dept_no` in ( <materialize> (/* select#2 */ select `employees`.`dept_manager`.`dept_no` from `employees`.`dept_manager` where (`employees`.`dept_manager`.`to_date` is not null) ), <primary_index_lookup>(`employees`.`departments`.`dept_no` in <temporary table> on <auto_key> where ((`employees`.`departments`.`dept_no` = `materialized-subquery`.`dept_no`)))))

Now the picture is completely different. There is no nested_loop member, and instead there is an attached_subqueries array containing a single member: the temporary table materialized from the subquery select dept_no from dept_manager where to_date is not null (including all the details of this materialization).

Conclusion: We can experiment with the value of optimizer_switch and use EXPLAIN FORMAT=JSON to examine how a particular optimization affects our queries.

The post EXPLAIN FORMAT=JSON provides insights on optimizer_switch effectiveness appeared first on MySQL Performance Blog.

Percona Server for MongoDB storage engines in iiBench insert workload

December 23, 2015 - 9:39am

We recently released the GA version of Percona Server for MongoDB, which comes with a variety of storage engines: RocksDB, PerconaFT and WiredTiger.

Both RocksDB and PerconaFT are write-optimized engines, so I wanted to compare all engines in a workload oriented to data ingestions.

For a benchmark I used iiBench-mongo (https://github.com/mdcallag/iibench-mongodb), and I inserted one billion (bln) rows into a collection with three indexes. Inserts were done in ten parallel threads.

For memory limits, I used a 10GB as the cache size, with a total limit of 20GB available for the mongod process, limited with cgroups (so the extra 10GB of memory was available for engine memory allocation and OS cache).

For the storage I used a single Crucial M500 960GB SSD. This is a consumer grade SATA SSD. It does not provide the best performance, but it is a great option price/performance wise.

Every time I mention WiredTiger, someone in the comments asks about the LSM option for WiredTiger. Even though LSM is still not an official mode in MongoDB 3.2, I added WiredTiger-LSM from MongoDB 3.2 into the mix. It won’t have the optimal settings, as there is no documentation how to use LSM in WiredTiger.

First, let me show a combined graph for all engines:

And now, let’s zoom in on the individual engines.

WiredTiger:

RocksDB + PerconaFT:

What conclusions can we make?

  1. WiredTiger’s memory (about the first one million (mln) rows) performed extremely well, achieving over 100,000 inserts/sec. As data grows and exceeds memory size, WiredTiger behaved as a traditional B-Tree engine (which is no surprise).
  2. PerconaFT and RocksDB showed closer to constant throughput, with RocksDB being overall better, However, with data growth both engines start to experience challenges. For PerconaFT, the throughput varies more with more data, and RocksDB shows more stalls (which I think is related to a compaction process).
  3. WiredTiger LSM didn’t show as much variance as a B-Tree, but it still had a decline related to data size, which in general should not be there (as we see with RocksDB, also LSM based).

Inserting data is only one part of the equation. Now we also need to retrieve data from the database (which we’ll cover in another blog post).

Configuration for PerconaFT:

numactl --interleave=all ./mongod --dbpath=/mnt/m500/perconaft --storageEngine=PerconaFT --PerconaFTEngineCacheSize=$(( 10*1024*1024*1024 )) --syncdelay=900 --PerconaFTIndexFanout=128 --PerconaFTCollectionFanout=128 --PerconaFTIndexCompression=quicklz --PerconaFTCollectionCompression=quicklz --PerconaFTIndexReadPageSize=16384 --PerconaFTCollectionReadPageSize=16384

Configuration for RocksDB:

storage.rocksdb.configString: "bytes_per_sync=16m;max_background_flushes=3;max_background_compactions=12;max_write_buffer_number=4;max_bytes_for_level_base=1500m;target_file_size_base=200m;level0_slowdown_writes_trigger=12;write_buffer_size=400m;compression_per_level=kSnappyCompression:kSnappyCompression:kSnappyCompression:kSnappyCompression:kSnappyCompression:kSnappyCompression:kSnappyCompression;optimize_filters_for_hits=true"

Configuration for WiredTiger-3.2 LSM:

storage.wiredTiger.collectionConfig.configString: "type=lsm" storage.wiredTiger.indexConfig.configString: "type=lsm"

Load parameters for iibench:

TEST_RUN_ARGS_LOAD="1000000000 6000 1000 999999 10 256 3 0

The post Percona Server for MongoDB storage engines in iiBench insert workload appeared first on MySQL Performance Blog.

EXPLAIN FORMAT=JSON: everything about attached_subqueries, optimized_away_subqueries, materialized_from_subquery

December 22, 2015 - 2:41pm
EXPLAIN FORMAT=JSON

The regular EXPLAIN command already provides some information about subquery optimization. For example, you can find out if the subquery is dependent or not, and (since version 5.6) if it was materialized:

mysql> explain select dept_name from departments where dept_no in (select dept_no from dept_manager where to_date is not null)G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: departments partitions: NULL type: index possible_keys: PRIMARY key: dept_name key_len: 42 ref: NULL rows: 9 filtered: 100.00 Extra: Using where; Using index *************************** 2. row *************************** id: 1 select_type: SIMPLE table: <subquery2> partitions: NULL type: eq_ref possible_keys: <auto_key> key: <auto_key> key_len: 4 ref: employees.departments.dept_no rows: 1 filtered: 100.00 Extra: NULL *************************** 3. row *************************** id: 2 select_type: MATERIALIZED table: dept_manager partitions: NULL type: ALL possible_keys: dept_no key: NULL key_len: NULL ref: NULL rows: 24 filtered: 90.00 Extra: Using where 3 rows in set, 1 warning (0.00 sec) Note (Code 1003): /* select#1 */ select `employees`.`departments`.`dept_name` AS `dept_name` from `employees`.`departments` semi join (`employees`.`dept_manager`) where ((`<subquery2>`.`dept_no` = `employees`.`departments`.`dept_no`) and (`employees`.`dept_manager`.`to_date` is not null))

However, you can’t find details on exactly how this subquery was materialized. To find out more, use EXPLAIN FORMAT=JSON:

mysql> explain format=json select dept_name from departments where dept_no in (select dept_no from dept_manager where to_date is not null)G *************************** 1. row *************************** EXPLAIN: { "query_block": { "select_id": 1, "cost_info": { "query_cost": "16.72" }, "nested_loop": [ { "table": { "table_name": "departments", <skipped> }, { "table": { "table_name": "<subquery2>", "access_type": "eq_ref", "key": "<auto_key>", "key_length": "4", "ref": [ "employees.departments.dept_no" ], "rows_examined_per_scan": 1, "materialized_from_subquery": { "using_temporary_table": true, "query_block": { "table": { "table_name": "dept_manager", "access_type": "ALL", "possible_keys": [ "dept_no" ], "rows_examined_per_scan": 24, "rows_produced_per_join": 21, "filtered": "90.00", "cost_info": { "read_cost": "1.48", "eval_cost": "4.32", "prefix_cost": "5.80", "data_read_per_join": "345" }, "used_columns": [ "dept_no", "to_date" ], "attached_condition": "(`employees`.`dept_manager`.`to_date` is not null)" } } } } } ] } } 1 row in set, 1 warning (0.00 sec) Note (Code 1003): /* select#1 */ select `employees`.`departments`.`dept_name` AS `dept_name` from `employees`.`departments` semi join (`employees`.`dept_manager`) where ((`<subquery2>`.`dept_no` = `employees`.`departments`.`dept_no`) and (`employees`.`dept_manager`.`to_date` is not null))

We can see here that the new temporary table ("using_temporary_table": true) with internal name <subquery2> was materilized from the subquery (materialized_from_subquery). We also see how this new temporary table was accessed.

We also can see that the member that contains information about the materialized subquery is part of the nested_loop array. Nested loop optimization applies to a regular JOIN command. In this case, it means that the MySQL Server would join the temporary table with an outer query.

Another interesting case is optimized_away_subqueries: this command shows subqueries that were executed only once and were replaced by their result. A regular EXPLAIN command provides no information about such optimization:

mysql> explain select emp_no, salary from salaries order by (select max(salary) from salaries)G *************************** 1. row *************************** id: 1 select_type: PRIMARY table: salaries partitions: NULL type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 2838525 filtered: 100.00 Extra: NULL *************************** 2. row *************************** id: 2 select_type: SUBQUERY table: salaries partitions: NULL type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 2838525 filtered: 100.00 Extra: NULL 2 rows in set, 1 warning (0.00 sec) Note (Code 1003): /* select#1 */ select `employees`.`salaries`.`emp_no` AS `emp_no`,`employees`.`salaries`.`salary` AS `salary` from `employees`.`salaries` order by (/* select#2 */ select max(`employees`.`salaries`.`salary`) from `employees`.`salaries`)

We can see that in select_type: SUBQUERY, the output of SHOW WARNINGS  doesn’t give us much insight either.

However, EXPLAIN FORMAT=JSON tells us how this subquery was optimized:

mysql> explain format=json select emp_no, salary from salaries order by (select max(salary) from salaries)G *************************** 1. row *************************** EXPLAIN: { "query_block": { "select_id": 1, "cost_info": { "query_cost": "573505.00" }, "ordering_operation": { "using_filesort": false, "table": { "table_name": "salaries", <skipped> "optimized_away_subqueries": [ { "dependent": false, "cacheable": true, "query_block": { "select_id": 2, "cost_info": { "query_cost": "573505.00" }, "table": { "table_name": "salaries", "access_type": "ALL", "rows_examined_per_scan": 2838525, "rows_produced_per_join": 2838525, "filtered": "100.00", "cost_info": { "read_cost": "5800.00", "eval_cost": "567705.00", "prefix_cost": "573505.00", "data_read_per_join": "43M" }, "used_columns": [ "salary" ] } } } ] } } } 1 row in set, 1 warning (0.00 sec)

After seeing optimized_away_subqueries and "cacheable": true, we know that the subquery was executed only once, and the result was cached and used to resolve the outer query.

Another subquery type that I want to mention in this post is attached_subqueries – which covers subqueries that are not converted to JOIN.

mysql> explain format=json select emp_no from salaries where salary > (select avg(salary) from salaries)G *************************** 1. row *************************** EXPLAIN: { "query_block": { "select_id": 1, "cost_info": { "query_cost": "250601.60" }, "table": { "table_name": "salaries", "access_type": "ALL", "rows_examined_per_scan": 1240668, "rows_produced_per_join": 413514, "filtered": "33.33", "cost_info": { "read_cost": "167898.67", "eval_cost": "82702.92", "prefix_cost": "250601.60", "data_read_per_join": "6M" }, "used_columns": [ "emp_no", "salary" ], "attached_condition": "(`employees`.`salaries`.`salary` > (/* select#2 */ select avg(`employees`.`salaries`.`salary`) from `employees`.`salaries`))", "attached_subqueries": [ { "dependent": false, "cacheable": true, "query_block": { "select_id": 2, "cost_info": { "query_cost": "250601.80" }, "table": { "table_name": "salaries", "access_type": "ALL", "rows_examined_per_scan": 1240669, "rows_produced_per_join": 1240669, "filtered": "100.00", "cost_info": { "read_cost": "2468.00", "eval_cost": "248133.80", "prefix_cost": "250601.80", "data_read_per_join": "18M" }, "used_columns": [ "salary" ] } } } ] } } } 1 row in set, 1 warning (0.00 sec) Note (Code 1003): /* select#1 */ select `employees`.`salaries`.`emp_no` AS `emp_no` from `employees`.`salaries` where (`employees`.`salaries`.`salary` > (/* select#2 */ select avg(`employees`.`salaries`.`salary`) from `employees`.`salaries`))

For these subqueries, using this option shows if they are dependent and/or cacheable. There is also a query_block member inside the attached_subqueries object which provides all the details on how the query was optimized.

Conclusion: EXPLAIN FORMAT=JSON  provides some great details on subqueries optimization.

The post EXPLAIN FORMAT=JSON: everything about attached_subqueries, optimized_away_subqueries, materialized_from_subquery appeared first on MySQL Performance Blog.



General Inquiries

For general inquiries, please send us your question and someone will contact you.