Buy Percona ServicesBuy Now!
Subscribe to Latest MySQL Performance Blog posts feed
Updated: 2 hours 8 min ago

Percona Server for MongoDB 3.2.12-3.2 is now available

March 9, 2017 - 10:52am

Percona announces the release of Percona Server for MongoDB 3.2.12-3.2 on March 9, 2017. Download the latest version from the Percona web site or the Percona Software Repositories.

Percona Server for MongoDB 3.2.11-3.1 is an enhanced, open-source, fully compatible, highly scalable, zero-maintenance downtime database supporting the MongoDB v3.2 protocol and drivers. It extends MongoDB with MongoRocks, Percona Memory Engine, and PerconaFT storage engine, as well as enterprise-grade features like External Authentication, Audit Logging, Profiling Rate Limiting, and Hot Backup at no extra cost. Percona Server for MongoDB requires no changes to MongoDB applications or code.

NOTE: We deprecated the PerconaFT storage engine. It will not be available in future releases.

This release is based on MongoDB 3.2.12 and includes the following additional changes:

  • PSMDB-17: Changed welcome message in the shell to mention Percona Server for MongoDB instead of MongoDB
  • PSMDB-90: Added error message for storage engines that do not support Hot Backup
  • PSMDB-91: Deprecated audit configuration section and added auditLog instead
  • PSMDB-95: Fixed version dependencies for sub packages so that all corresponding packages get updated accordingly
  • PSMDB-96: Excluded diagnostic.data directory when using TokuBackup with PerconaFT
  • PSMDB-98: Improved Hot Backup to create destination folder if it does not exist
  • PSMDB-101: Implemented the auditAuthorizationSuccess parameter to enable auditing of authorization success
  • PSMDB-104: Updated links in client shell output to point to Percona’s documentation and forum
  • PSMDB-107: Fixed behavior when creating the audit log file
  • PSMDB-111: Refactored external_auth tests
  • PSMDB-123: Fixed the creation of proper subdirectories inside the backup destination directory
  • PSMDB-126: Added index and collection name to duplicate key error message
  • Fixed startup scripts for Ubuntu 14.04.5 LTS (Trusty Tahr)
  • Fixed a number of other small issues and bugs

Percona Server for MongoDB 3.2.12-3.2 release notes are available in the official documentation.

Services Monitoring with Probabilistic Fault Detection

March 9, 2017 - 10:19am

In this blog post, we’ll discuss services monitoring using probabilistic fault detection.

Let’s admit it, the task of monitoring services is one of the most difficult. It is time-consuming, error-prone and difficult to automate. The usual monitoring approach has been pretty straightforward in the last few years: setup a service like Nagios, or pay money to get a cloud-based monitoring tool. Then choose the metrics you are interested in and set the thresholds. This is a manual process that works when you have a small number of services and servers, and you know exactly how they behave and what you should monitor. These days, we have hundred of servers with thousands of services sending us millions of metrics. That is the first problem: the manual approach to configuration doesn’t work.

That is not the only problem. We know that no two servers perform the same because no two servers have exactly the same workload. The thresholds that you setup for one server might not be the correct one for all of the other thousand. There are some approaches to the problem that will make it even worse (like taking averages and setting the thresholds based on those, for example, hoping it will work). Let me tell you a secret: it won’t work. Here we have a second problem: instances of the same type can demonstrate very different behaviors.

The last problem is that new shiny services you company may want to use are announced every week. It is impossible, because of time constraints, to know all of those services well enough to create a perfect monitoring template. In other words: sometimes we are asked to monitor software we don’t completely understand.

In summary, you have thousands of services, some of them you don’t even know how they work, that are sending you million of metrics that mean nothing to you. Now, set the thresholds and enable the pager alert. The nightmare has started. Is there a different approach?

Machine Learning

We have to stop thinking that monitoring is a bunch of config files with thresholds that we copy from one server to another. There are no magic templates that will work. We need to use a different technique that removes us from the process. That template is “machine learning.” As stated in Wikipedia, it is a subfield of computer science that gives computers the ability to learn without being explicitly programmed. In it’s most basic form, it can be used to solve classification problems. For example, open pet photos and identify if it is a cat or a dog. This is a classification problem that both humans and computers can solve, but we are much much slower. The computer has to take the time to learn the patterns, but at some point it will do the classification in no time.

I hope you are starting to see a pattern here. Why do we need to care about monitoring and its configuration if we have computers that can learn patterns and classify things for us?

There are two main ways of doing probabilistic fault detection: Novelty Detection and Outlier Detection.

Novelty Detection

Novelty Detection is easy to visualize and understand. It takes a series of inputs and tries to find anomalies, something that hasn’t been seen before. For example, our credit card company has a function that takes “category, expense, date, hour, country” as arguments and returns an integer so that they can classify and identify all the purchases. Your monthly use of the credit card looks like this:

[0,4,4,5,5,5,4,3]

That is the normal model that defines your use of the credit card. Now, it can be used to detect anomalies.

  • [0] – OK
  • [4] – OK
  • [4] – OK
  • [1] – Anomaly! Operation canceled.

Easy and straightforward. It is simple and very useful in a lot of areas to generate alerts when something anomalous happens. One of the machine learning models that can do this is One-Class Support Vector Machines, but since this is not the kind of fault detection we are looking for I won’t go into details. If you are interested, follow these links:

Outlier Detection

Let’s say we have this data:

[0, 3, 5, 1, -2, 19, 2, 10, -9, 2, 1, 8, 3, 21, -1, 3]

Now we know how to find anomalies, but how do we find outliers? Looking at the numbers above, it seems 21, 19 and -9 could be outliers. But a more exact definition is needed (not just intuition). The most simple and usual way of doing it is the following:

We divide our data into three pieces. One cut will be done at 25%, the second cut at 75%. The number that it is at 25% is called the First Quartile, and the value of the second cut is called the Third Quartile. The IQR or Interquartile Range is the subtraction of the Third Quartile with the First Quartile. Now, an outlier is any number that falls in one of these two categories:

  • If the value is below: (First Quartile) – (1.5 × IQR)
  • If the value is above: (Third Quartile + (1.5 × IQR)

Using Python:

inputs = [0, 3, 5, 1, -2, 19, 2, 10, -9, 2, 1, 8, 3, 21, -1, 3] Q1 = np.percentile(inputs,25) Q3 = np.percentile(inputs,75) step = (Q3-Q1)*1.5 Q1 0.75 Q3 5.75 step 7.5 outliers = [x for x in inputs if x < Q1-step or x > Q3+step] outliers [19, -9, 21]

This looks more like what we need. If we are monitoring a metric, and outliers are detected, then something is happening there that requires investigation. Some of the most used outlier detection models in scikit-learn are:

  • Elliptic Envelope: a robust co-variance estimate that assumes that our data is Gaussian distributed. It will define the shape of the data we have, creating a frontier that delimits the contour. As you probably guessed, it will be elliptical in shape. Don’t worry about the assumption of Gaussian distribution, data can be standardized. More about this later on.

 

  • Isolation Forest: this is the well-known “forest of random trees,” but applied to outlier detection. This is more suitable when we have many different input metrics. In the example I use below, I just use a single metric, so this model would not work that well.

Therefore, Elliptic Envelope looks like the best option for our proof-of-concept.

For visual reference, this is how the three models look like when they try to shape two data inputs:

Source: scikit-learn.org

 

Proof-of-Concept

I haven’t explained the model in detail, but a high level explanation should be enough to understand the problem and the possible solution. Let’s start building a proof-of-concept.

For this test, I got data from our Prometheus setup, where all the time-series monitoring data from our customers is stored. In this particular example, I got numbers from the “Threads Running” metric. Those will be used to train our Elliptical Envelope. It is important to take the following into account:

    • We need to collect enough data so that it captures the correct shape of our baseline performance. For example, usually nighttime hours have less of a workload than during the day (same with weekend days, in some cases).
    • As explained before, it assumes a Gaussian distribution, which means that the data needs to be scaled. I am going to standardize the data so that it has 0 mean and 1 variance. The same standardization needs to be applied to the data we test after the training process, when the monitoring is already in place. That standardization also needs to be applied to each metric individually. This is the formula:
Source: dataminingblog.com

With μ as the mean and σ as the standard deviation.

This is the summary of what our proof-of-concept will do:

  • Read Prometheus JSON dump.
  • Separate some data for training, standardizing it first.
  • Separate some data for testing, standardizing it first as well.
  • Make predictions on test data.
  • For those rows identified as outliers, get the original non-standardize data to see the number of threads running.

So, let’s start:

Import the Libraries

import pandas as pd import numpy as np import json from datetime import datetime from sklearn.preprocessing import StandardScaler from sklearn.covariance import EllipticEnvelope

Load the Data

All the information is in a JSON output from Prometheus that has the “threads_running” of a particular server. It has one second granularity for the last four weeks. I also converted “timestamps” to a normal “datetime” object so that it is easier to read:

with open('query.json') as data_file: data = json.load(data_file) data = pd.DataFrame(data["data"]["result"][0]["values"]) data[0] = data[0].astype(int) data[0] = data[0].apply(lambda x: datetime.fromtimestamp(x))

The data looks like this:

DateTime Threads Running 2017-01-19 20:32:44 1 2017-01-19 20:32:45 1 2017-01-19 20:32:46 2 … …

 

Create the Training and Testing Dataset

First, separate some of the data for use as training:

train_data = data[(data[0] >= "2017-01-22") & (data[0] <= "2017-01-28" )]

Ignore the date column, and just store the metrics:

train_data = train_data.iloc[:,[1]]

Standardize it:

escaler = StandardScaler() train_data = escaler.fit_transform(train_data)

Now the data looks like this:

Standardized Threads Running -0.4072634 -0.4072634 0.47153585 …

To create the test dataset we need to follow the exact same procedure, only select a different timeframe:

test_original_data = data[(data[0] >= "2017-02-2") & (data[0] <= "2017-02-17" )] test_data = test_original_data.iloc[:,[1]] test_data = escaler.transform(test_data)

Train the Model

Let’s create our model with the training data! I am using two parameters here:

  • assume_centered: to specify that our data is already Gaussian distributed.
  • contamination: to specify the ratio of outliers our training data has.

clf = EllipticEnvelope(assume_centered=True,contamination=0) clf.fit(train_data)

Search for Outliers

Now that we’ve trained the model and we have our test data, we can ask the model if it finds any outliers. It will return 1 or -1 for each row. “1” means that the value of threads running is normal and within the boundaries, while “-1” means that the value is an outlier:

predictions = clf.predict(test_data) outliers = np.where(predictions==-1)

The array “outliers” stores the row numbers where -1 was predicted.

At this point we have three important variables:

  • test_data: standardized testing data.
  • test_original_data: the original test data without modification.
  • outliers: the row numbers where an outlier was detected (-1).
Investigate the Outliers

Since we have the row number where an outlier was detected, now we can just query test_original_data and search for those rows. In this example, I show some random ones:

for indice in outliers[0]: if np.random.randn() > 2.5: print("{} - {} threads running".format(test_original_data.iloc[indice][0], test_original_data.iloc[indice][1])) 2017-02-03 11:26:03 - 41 threads running 2017-02-03 11:26:40 - 43 threads running 2017-02-03 11:27:50 - 48 threads running 2017-02-03 11:32:07 - 78 threads running 2017-02-03 11:33:25 - 90 threads running 2017-02-12 10:06:58 - 36 threads running 2017-02-12 10:12:11 - 60 threads running 2017-02-12 10:12:30 - 64 threads running

And there we have it! Dates and hours when something really out of the ordinary happened. No need to create a config file for each service, guess thresholds, adjust them … nothing. Just let the model learn, and you get alerts when something unexpected happens. Push all the metrics from your services to these models, and let them do the hard work.

Summary

Most companies have similar situations. Companies add new services on hundred of servers, and monitoring is an essential part of the infrastructure. The old method of monolithic config files with some thresholds doesn’t scale, because it needs a lot of manual work with a trial/error approach. The types of techniques explained in this blog post can help us deploy monitoring on hundred of servers, not really caring about the different nuances of each service or workload. It is even possible to start monitoring a service without even knowing anything about it — just let the probabilistic model take care of it.

It is important to clarify that, in my opinion, these fault detection models are not going to be a substitute for software like Nagios. In those areas where a binary test is needed (service is up/down for example), Nagios and other similar services do a good job. Actually, a Nagios check can use the procedure explained here. When there are many metrics to analyze, probabilistic methods can save us from a nightmare.

Migrating MongoDB Away from MMAPv1

March 8, 2017 - 3:16pm

This is another post in the series of blogs on the Percona Server for MongoDB 3.4 bundle release. In this blog post, we’ll discuss moving away from the MMAPv1 storage engine.

Introduction

WIth the MongoDB v3.0 release in February of 2015, the long-awaited ability to choose storage engines became a reality. As of version 3.0, you could choose two engines in MongoDB Community Server and, if you use Percona Server for MongoDB, you could choose from four. Here’s a table for ease of consumption:

Here’s a table for easy consumption:

Storage Engine Percona Server for MongoDB MongoDB Community Server MongoDB Enterprise Server (licensed) MMAPv1

WiredTiger

MongoRocks

In-memory

Encrypted

 

Why change engines?

With increased possibilities comes an increase in the decision-making process difficult (a concept that gets reinforced every time I take my mother out a restaurant with a large menu – ordering is never quick). In all seriousness, the introduction of the storage engine API to MongoDB is possibly the single greatest feature MongoDB, Inc has released to-date.

One of the biggest gripes from the pre-v3.0 days was MongoDB’s lack of scale. This was mostly due to the MMAPv1 storage engine, which suffered from a very primitive locking scheme. If you would like a illustration of the problem, think of the world’s biggest supermarket with one checkout line – you might be able to fit in lots of shoppers, but they’re not going to accomplish their goal quickly. So, the ability to increase performance and concurrency with a simple switch is huge! Additionally, modern storage engines support compression. This should reduce your space utilization when switching by at least 50%.

All the way up to MongoDB v3.2, the default storage engine was MMAPv1. If you didn’t make a conscious decision about what storage engine to choose when you started using MongoDB, there is a good chance that MMAPv1 is what you’re on. If you’d like to find out for sure what engine you’re using, simply run the command below. The output will be the name of the storage engine. As you can see, I was running the MMAPv1 storage engine on this machine. Now that we understand where we’re at, let’s get into where we can be in the future.

db.serverStatus().storageEngine.name mmapv1

Public Service Announcement

Before we get into what storage engine(s) to evaluate, we need to talk about testing. In my experience, a majority of the MySQL and MongoDB community members are rolling out changes to production without planning or testing. If you’re in the same boat, you’re in very good company (or at least in a great deal of company). However, you should stop this practice. It’s basic “sample size” in statistics – when engaged in risk-laden behavior, the optimal time to stop increasing the sample size is prior to the probability of failure reaching “1”. In other words, start your testing and planning process today!

At Percona, we recommend that you thoroughly test any database changes in a testing or development environment before you decide to roll them into production. Additionally, prior to rolling the changes into production (with a well thought out plan, of course), you’ll need to have a roll-back plan in case of unintended consequences. Luckily, with MongoDB’s built-in replication and election protocols, both are fairly easy. The key here is to plan. This is doubly true if you are undertaking a major version upgrade, or are jumping over major versions. With major version upgrades (or version jumps) comes the increased likelihood of a change in database behavior as it relates to your application’s response time (or even stability).

What should I think about?

In the table above, we listed the pre-packaged storage engine options that are available to us and other distributions. We also took a look at why you should consider moving off of MMAPv1 in the preceding section. To be clear, in my opinion a vast majority of MongoDB users that are on MMAPv1 can benefit from a switch. Which engine to switch to is the pressing question. Your first decision should be to evaluate whether or not your workload fits into the sweet spot for MMAPv1 by reading the section below. If that section doesn’t describe your application, then the additional sections should help you narrow down your choices.

Now, let’s take a look at what workloads match up with what storage engines.

MMAPv1

Believe it or not, there are some use cases where MMAPv1 is likely to give you as good (or better) performance as any other engine. If you’re not worried about the size of your database on disk, then you may not want to bother changing engines. Users that are likely to see no benefit from changing have read-heavy (or 100%) read applications. Also, certain update-heavy use cases, where you’re updating small amounts of data or performing $set operations, are likely to be faster on MMAPv1.

WiredTiger

WiredTiger is a the new default storage engine for MongoDB. It is a good option for general workloads that are currently running on MMAPv1. WiredTiger will give you good performance for most workloads and will reduce your storage utilization with compression that’s enabled by default. If you have a write-heavy workload, or are approaching high I/O utilization (>55%) with plans for it to rise, then you might benefit from a migration to WiredTiger.

MongoRocks (RocksDB from Facebook)

This is Facebook’s baby, which was forged in the fires of the former Parse business unit. MongoRocks, which uses LSM indexing, is advertised as “designed to work with fast storage.” Don’t let this claim fool you. For workloads that are heavy on writes, highly concurrent or approaching disk bound, MongoRocks could give you great benefits. In terms of compression, MongoRocks has the ability to efficiently handle deeper compression algorithms, which should further decrease your storage requirements.

In-Memory

The in-memory engine, whether we’re speaking about the MongoDB or Percona implementation, should be used for workloads where extreme low latency is the most important requirement. The types of applications that I’m talking about are usually low-latency, “real-time” apps — like decision making or user session tracking. The in-memory engine is not persistent, so it operates strictly out of the cache allocated to MongoDB. Consequently, the data may (and likely will) be lost if the server crashes.

Encrypted

This is for applications in highly secure environments where on-disk encryption is necessary for compliance. This engine will protect the MongoDB data in the case that a disk or server is stolen. On the flip side, this engine will not protect you from a hacker that has access to the server (MongoDB shell), or can intercept your application traffic. Another way to achieve the same level of encryption for compliance is using volume level encryption like LUKS. An additional benefit of volume level encryption, since it works outside the database, is re-use on all compliant servers (not just MongoDB).

Getting to your new engine

Switching to the new engine is actually pretty easy, especially if you’re running a replica set. One important caveat is that unlike MySQL, the storage engine can only be defined per mongod process (not per database or collection). This means that it’s an all or nothing operation on a single MongoDB process. You’ll need to reload your data on that server. This is necessary because the data files from one engine are not compatible with another engine. Thus reloading the data to transform from one engine format to another is necessary. Here are the high-level steps (assuming you’re running a replica set):

  1. Make sure you’re not in your production environment
  2. Backup your data (it can’t hurt)
  3. Remove a replica set member
  4. Rename (or delete) the old data directory. The member will re-sync with the replica set
    • Make sure you have enough disk space if you’re going to keep a copy of the old data directory
  5. Update the mongo.conf file to use a new storage engine. Here’s an example for RocksDB from our documentation:
    storage:  engine: rocksdb  rocksdb:    cacheSizeGB: 4    compression: snappy
  6. Start the MongoDB process again
  7. Join the member to the replica set (initial sync will happen)
  8. When the updated member is all caught up, pick another member and repeat the process.
  9. Continue until the primary is the only server left. At this point, you should step down the primary, but hold off switching storage engines until you are certain that the new storage engine meets your needs.
The Wrap Up

At this point I’ve explained how you can understand your options, where you can gain additional performance and what engines to evaluate. Please don’t forget to test your application with the new setup before launching into production. Please drop a comment below if you found this helpful or, on the other hand, if there’s something that would make it more useful to you. Chances are, if you’d find something helpful, the rest of the community will as well.

Webinar Thursday, March 9, 2017: Troubleshooting Issues with MySQL Character Sets

March 6, 2017 - 3:52pm

Please join Percona’s Principal Technical Services Engineer, Sveta Smirnova as she presents “Troubleshooting Issues with MySQL Character Sets ” on March 9, 2017, at 11:00 am PST / 2:00 pm EST (UTC-8).

Register Now

Many MySQL novices find MySQL character sets support puzzling. But after you understand how it is designed, you will find it much more powerful than many other competing database solutions.

MySQL allows to specify a character set for every object, and change it online. For years this has helped to create fast applications that can work with audiences all around the world. However, it also requires any DBA troubleshooting character set issues to have a deep understanding of how they work. Different sort rules and collations can complicate the process.

In the webinar we will discuss:

  • Which character sets and collations MySQL supports
  • How they can be set
  • How to understand error messages
  • How to solve character sets/collations compatibility issues
  • What server, application, command line and graphical tool options are available
  • What to check first and how to continue troubleshooting
  • What the various compatibility issues are
  • How to convert data, created in earlier versions
  • What the best practices are

Register for the webinar here.

Sveta Smirnova, Principal Technical Services Engineer

Sveta joined Percona in 2015. Her main professional interests are problem-solving, working with tricky issues, bugs, finding patterns that can solve typical issues quicker and teaching others how to deal with MySQL issues, bugs and gotchas effectively. Before joining Percona, Sveta worked as Support Engineer in MySQL Bugs Analysis Support Group in MySQL AB-Sun-Oracle. She is the author of the book “MySQL Troubleshooting” and JSON UDF functions for MySQL.

 

MySQL, –i-am-a-dummy!

March 6, 2017 - 11:54am

In this blog post, we’ll look at how “operator error” can cause serious problems (like the one we saw last week with AWS), and how to avoid them in MySQL using --i-am-a-dummy.

Recently, AWS had some serious downtime in their East region, which they explained as the consequence of a bad deployment. It seems like most of the Internet was affected in one way or another. Some on Twitter dubbed it “S3 Dependency Awareness Day.”

Since the outage, many companies (especially Amazon!) are reviewing their production access and deployment procedures. It would be a lie if I claimed I’ve never made a mistake in production. In fact, I would be afraid of working with someone who claims to have never made a mistake in a production environment.

Making a mistake or two is how you learn to have a full sense of fear when you start typing:

UPDATE t1 SET c1='x' ...

I think many of us have experienced forehead sweats and hand shaking in these cases – they save us from major mistakes!

The good news is that MySQL can help you with this. All you have to do is admit that you are human, and use the following command (you can also set this in your user directory .my.cnf):

mysql --i-am-a-dummy

Using this command (also known as safe-updates) sets the following SQL mode when logging into the server:

SET sql_safe_updates=1, sql_select_limit=1000, max_join_size=1000000;

The safe-updates and iam-a-dummy flags were introduced together in MySQL 3.23.11, and according to some sites from around the time of release, it’s “for users that once may have done a DELETE FROM table_name but forgot the WHERE clause.”

What this does is ensure you can’t perform an UPDATE or DELETE without a WHERE clause. This is great because it forces you to think through what you are doing. If you still want to update the whole table, you need to do something like WHERE ID > 0. Interestingly, safe-updates also blocks the use of WHERE 1, which means “where true” (or basically everything).

The other safety you get with this option is that SELECT is automatically limited to 1000 rows, and JOIN is limited to examining 1 million rows. You can override these latter limits with extra flags, such as:

--select_limit=500 --max_join_size=10000

I have added this to the .my.cnf on my own servers, and definitely use this with my clients.

MongoDB Audit Log: Why and How

March 3, 2017 - 3:24pm

This blog post is another in the series on the MongoDB 3.4 bundle release. In this blog post, we’ll talk about the MongoDB audit log.

Percona’s development team has always invested in the open-source community a priority – especially for MongoDB. As part of this commitment, Percona continues to build MongoDB Enterprise Server features into our free, alternative, open-source Percona Server for MongoDB. One of the key features that we have added to Percona Server for MongoDB is audit logging. Auditing your MongoDB environment strengthens your security and helps you keep track of who did what in your database.

In this blog post, we will show how to enable this functionality, what general actions can be logged, and how you can filter only the information that is important for your use-case.

Enable Audit Log

Audit messages can be logged into syslog, console or file (JSON or BSON format). In most cases, it’s preferable to log to the file in BSON format (the performance impact is smaller than JSON). In the last section, you can find some simple examples of how to further query this type of file.

Enable the audit log in the command line or the config file with:

mongod --dbpath /var/lib/mongodb --auditDestination file --auditFormat BSON --auditPath /var/lib/mongodb/auditLog.bson

auditLog: destination: file format: BSON path: /var/lib/mongodb/auditLog.bson

Just note that until this bug is fixed and released, if you’re using Percona Server for MongoDB and the --fork option while starting the mongod instance you’ll have to provide an absolute path for audit log file instead of relative path.

Actions logged

Generally speaking, the following actions can be logged:

  • Authentication and authorization
  • Cluster operations
  • Read and write operations (logged under authCheck event and require auditAuthorizationSuccess parameter to be enabled)
  • Schema operations
  • Custom application messages (logged under applicationMessage event if the client/app issues a logApplicationMessage command,  the user needs to have clusterAdmin role or the one that inherits from it to issue this command)

You can see the whole list of actions logged here.

By default, MongoDB doesn’t log all the read and write operations. So if you want to track those, you’ll have to enable the auditAuthorizationSuccess parameter. They then will be logged under the authCheck event. Note that this can have a serious performance impact.

Also, this parameter can be enabled dynamically on an already running instance with the audit log setup, while some other things can’t be changed once setup.

Enable logging of CRUD operations in the command line or config file:

mongod --dbpath /var/lib/mongodb --setParameter auditAuthorizationSuccess=true --auditDestination file --auditFormat BSON --auditPath /var/lib/mongodb/auditLog.bson

auditLog: destination: file format: BSON path: /var/lib/mongodb/auditLog.bson setParameter: { auditAuthorizationSuccess: true }

Or to enable it on the running instance, issue this command in the client:

db.adminCommand( { setParameter: 1, auditAuthorizationSuccess: true } )

Filtering

If you don’t want to track all the events MongoDB is logging by default, you can specify filters in the command line or the config file. Filters need to be valid JSON queries on the audit log message (format available here). In the filters, you can use standard query selectors ($eq, $in, $gt, $lt, $ne, …) as well as regex. Note that you can’t change the filters dynamically after the start.

Also, Percona Server for MongoDB 3.2 and 3.4 have slightly different message formats. 3.2 uses a “params” field, and 3.4 uses “param” just like MongoDB. When filtering on those fields, you might want to check for the difference.

Filter only events from one user:

mongod --dbpath /var/lib/mongodb --auditDestination file --auditFormat BSON --auditPath /var/lib/mongodb/auditLog.bson --auditFilter '{ "users.user": "prod_app" }'

auditLog: destination: file format: BSON path: /var/lib/mongodb/auditLog.bson filter: '{ "users.user": "prod_app" }'

Filter events from several users based on username prefix (using regex):

mongod --dbpath /var/lib/mongodb --auditDestination file --auditFormat BSON --auditPath /var/lib/mongodb/auditLog.bson --auditFilter '{ "users.user": /^prod_app/ }'

auditLog: destination: file format: BSON path: /var/lib/mongodb/auditLog.bson filter: '{ "users.user": /^prod_app/ }'

Filtering multiple event types by using standard query selectors:

mongod --dbpath /var/lib/mongodb --auditDestination file --auditFormat BSON --auditPath /var/lib/mongodb/auditLog.bson --auditFilter '{ atype: { $in: [ "dropCollection", "dropDatabase" ] } }'

auditLog: destination: file format: BSON path: /var/lib/mongodb/auditLog.bson filter: '{ atype: { $in: [ "dropCollection", "dropDatabase" ] } }'

Filter read and write operations on all the collections in the test database (notice the double escape of dot in regex):

mongod --dbpath /var/lib/mongodb --auditDestination file --auditFormat BSON --auditPath /var/lib/mongodb/auditLog.bson --setParameter auditAuthorizationSuccess=true --auditFilter '{ atype: "authCheck", "param.command": { $in: [ "find", "insert", "delete", "update", "findandmodify" ] }, "param.ns": /^test\\./ } }'

auditLog: destination: file format: BSON path: /var/lib/mongodb/auditLog.bson filter: '{ atype: "authCheck", "param.command": { $in: [ "find", "insert", "delete", "update", "findandmodify" ] }, "param.ns": /^test\\./ } }' setParameter: { auditAuthorizationSuccess: true }

Example messages

Here are two example messages from an audit log file. The first one is from a failed client authentication, and the second one is where the user tried to insert a document into a collection for which he has no write authorization.

> bsondump auditLog.bson {"atype":"authenticate","ts":{"$date":"2017-02-14T14:11:29.975+0100"},"local":{"ip":"127.0.1.1","port":27017},"remote":{"ip":"127.0.0.1","port":42634},"users":[],"roles":[],"param":{"user":"root","db":"admin","mechanism":"SCRAM-SHA-1"},"result":18}

> bsondump auditLog.bson {"atype":"authCheck","ts":{"$date":"2017-02-14T14:15:49.161+0100"},"local":{"ip":"127.0.1.1","port":27017},"remote":{"ip":"127.0.0.1","port":42636},"users":[{"user":"antun","db":"admin"}],"roles":[{"role":"read","db":"admin"}],"param":{"command":"insert","ns":"test.orders","args":{"insert":"orders","documents":[{"_id":{"$oid":"58a3030507bd5e3486b1220d"},"id":1.0,"item":"paper clips"}],"ordered":true}},"result":13}

Querying audit log for specific event

The audit log feature is now working, and we have some data in the BSON binary file. How do I query it to find some specific event that interests me? Obviously there are many simple or more complex ways to do that using different tools (Apache Drill or Elasticsearch come to mind), but for the purpose of this blog post, we’ll show two simple ways to do that.

The first way without exporting data anywhere is using the bsondump tool to convert BSON to JSON and pipe it into the jq tool (command-line JSON processor) to query JSON data. Install the jq tool in Ubuntu/Debian with:

sudo apt-get install jq

Or in Centos with:

sudo yum install epel-release sudo yum install jq

Then, if we want to know who created a database with the name “prod” for example, we can use something like this (I’m sure you’ll find better ways to use the jq tool for querying this kind of data):

> bsondump auditLog.bson | jq -c 'select(.atype == "createDatabase") | select(.param.ns == "prod")' {"atype":"createDatabase","ts":{"$date":"2017-02-17T12:13:48.142+0100"},"local":{"ip":"127.0.1.1","port":27017},"remote":{"ip":"127.0.0.1","port":47896},"users":[{"user":"prod_app","db":"admin"}],"roles":[{"role":"root","db":"admin"}],"param":{"ns":"prod"},"result":0}

In the second example, we’ll use the mongorestore tool to import data into another instance of mongod, and then just query it like a normal collection:

> mongorestore -d auditdb -c auditcol auditLog.bson 2017-02-17T12:28:56.756+0100 checking for collection data in auditLog.bson 2017-02-17T12:28:56.797+0100 restoring auditdb.auditcol from auditLog.bson 2017-02-17T12:28:56.858+0100 no indexes to restore 2017-02-17T12:28:56.858+0100 finished restoring auditdb.auditcol (142 documents) 2017-02-17T12:28:56.858+0100 done

The import is done, and now we can query the collection for the same data from the MongoDB client:

> use auditdb switched to db auditdb > db.auditcol.find({atype: "createDatabase", param: {ns: "prod"}}) { "_id" : ObjectId("58a6de78bdf080b8e8982a4f"), "atype" : "createDatabase", "ts" : { "$date" : "2017-02-17T12:13:48.142+0100" }, "local" : { "ip" : "127.0.1.1", "port" : 27017 }, "remote" : { "ip" : "127.0.0.1", "port" : 47896 }, "users" : [ { "user" : "prod_app", "db" : "admin" } ], "roles" : [ { "role" : "root", "db" : "admin" } ], "param" : { "ns" : "prod" }, "result" : 0 }

It looks like the audit log in MongoDB/Percona Server for MongoDB is a solid feature. Setting up tracking for information that is valuable to you only depends on your use case.

A Look at MariaDB Subquery Cache

March 2, 2017 - 6:11pm

The MariaDB subquery cache feature added in MariaDB 5.3 is not widely known. Let’s see what it is and how it works.

What is a subquery cache?

The MariaDB subquery cache optimizes the execution of correlated subqueries. Correlated subqueries refer to a value from the parent query. For example:

SELECT id FROM product WHERE price NOT IN (SELECT MAX(price) FROM product GROUP BY category);

MariaDB only uses this optimization if the parent query is a SELECT, not an UPDATE or a DELETE. The subquery results get cached only for the duration of the parent query.

MariaDB added the subquery cache in v5.3. It is controlled by optimizer_switch, a dynamic variable that contains many flags that enable or disable several optimizations. To disable the subquery cache, run:

SET GLOBAL optimizer_switch='subquery_cache=OFF';

You can also do this at the session level.

How does subquery cache work?

Let’s see how it works. To make things clearer, we will use an example. Consider these tables:

CREATE TABLE t1 (a INT, b INT); INSERT INTO t1 VALUES (1,2), (3,4), (1,2), (3,4), (3,4), (3,5), (3,5), (5,1), (5,2), (3,6), (1,5); CREATE TABLE t2 (c INT, d INT); INSERT INTO t2 VALUES (1,10), (2,20), (3,30), (4,40);

Now, we issue this query:

SELECT b, (SELECT d FROM t2 WHERE a = c) FROM t1;

The server decides to read t1 first (the bigger table, as expected), and then access t2 using the subquery cache. It creates a MEMORY temporary table to store the results of the subquery, with an index on c (it is used to match the rows). Then it reads the first row from t1, and checks if the search is cached. It is not, so it reads t1 looking for rows with c=1 and copies the results into the cache. The next time it will find the value 1, and it will not need to access t2 because the matches are already cached. If you look at the data, you may notice that the value “5” appears twice in t1 (and is absent in t2). But the search is cached anyway, so the server searches for 5 in t2 only once.

I hope that you aren’t blindly accepting what I wrote until now: good DBAs need facts and metrics. Let’s be scientific: we’ll make a prediction, conduct an experiment and check the status variables to verify the prediction. The prediction is the list of rows that will be read from t1, with the running total of hits and misses:

(1,2) -- Miss: 1 (3,4) -- Miss: 2 (1,2) -- Hit: 1 (3,4) -- Hit: 2 (3,4) -- Hit: 3 (3,5) -- Hit: 4 (3,5) -- Hit: 5 (5,1) -- Miss: 3 (5,2) -- Hit: 6 (3,6) -- Hit: 7 (1,5) -- Hit: 8

MariaDB [test]> SHOW STATUS LIKE "subquery_cache%"; +---------------------+-------+ | Variable_name | Value | +---------------------+-------+ | Subquery_cache_hit | 8 | | Subquery_cache_miss | 3 | +---------------------+-------+ 2 rows in set (0.00 sec) MariaDB [test]> SHOW STATUS LIKE "handler_tmp_write"; +-------------------+-------+ | Variable_name | Value | +-------------------+-------+ | Handler_tmp_write | 3 | +-------------------+-------+ 1 row in set (0.00 sec)

The totals match, and the number of writes to the cache is equal to the misses (after a miss, a table access is done and cached).

The maximum size of an individual table is the minimum of tmp_table_size and max_heap_table_size. If the table size grows over this limit, the table is written to disk. If the MEMORY table creation fails (perhaps because MEMORY does not support BLOB), the subquery is not cached.

The total of hits and misses can be seen by reading two status variables: subquery_cache_hit and subquery_cache_miss. After 200 misses, the server checks the hit ratio for that particular subquery. If it is < 20%, it disables the cache for that subquery. If the hit rate is < 70% the table cannot be written to disk in case it exceeds the size limit. These numbers (200, 0.2, 0.7) are hardcoded and cannot be changed. But if you really want to test how MariaDB behaves with different parameters, you can change these constants in sql/sql_expression_cache.cc and recompile the server.

Isn’t this subquery materialization?

Subquery materialization is another strategy that the optimizer can choose to execute a query. It might look similar, because some data from a subquery are written to a MEMORY table – but this is the only similarity. The purpose and implementation of subquery materialization is different.

Let’s try to explain this with pseudocode.

Subquery materialization is for uncorrelated IN subqueries. Therefore the subquery is executed and materialized altogether:

# Query to optimize: SELECT ... WHERE col1 IN (subquery) materialize subquery into a MEMORY table with UNIQUE keys; foreach (row in outer query) { check if col1 current value exists in materialized table }

The subquery cache is for correlated subqueries. Thus the subquery gets executed only for non-cached values:

# Query to optimize: SELECT col1, (SELECT ... WHERE ... = col1) ... FROM ... foreach (outer query row) { if (col1 current value is cached) { read from cache } else { read from subquery cache col1 current value } }

Some considerations

Despite the similarity in names, the MariaDB subquery cache is not a query cache for subqueries. These features are different, implemented for different purposes. Obviously, the subquery cache doesn’t have the scalability and performance problems of the query cache (global mutex, table invalidation). As mentioned, a subquery cache table only survives for the duration of a statement, so it should be considered an optimizer strategy. For example, in some cases you might use the subquery cache for a WHERE … NOT IN subquery, but not for the WHERE … IN version, because the optimizer prefers to rewrite it as a JOIN.

Of course, not all correlated subqueries automatically benefit from this feature. Consider the example above: it is built to show that the subquery cache is useful. But we can easily build an example to show that can have a negative impact on performance: add rows to t1, and delete all duplicate values of a. There will be no hits, a temporary table is created, 200 reads and writes are performed, but it won’t help. After 200 misses, the cache will be disabled, yes, but what if this happens for each subquery? The damage may not be huge in a realistic case, but it is still damage. That’s why you can disable the MariaDB subquery cache.

Using Percona Toolkit pt-mongodb-summary

March 2, 2017 - 3:08pm

In this blog post, we’ll look at the pt-mongodb-summary tool in Percona Toolkit.

The pt-mongodb-summary tool from Percona Toolkit provides a quick at-a-glance overview of MongoDB and Percona Server for MongoDB instances. It is equivalent to pt-mysql-summary for MySQL. 

pt-mongodb-summary also collects information about a MongoDB cluster. It collects information from several sources to provide an overview of the cluster.

How It Works

The usage for the command is as follows:

pt-mongodb-summary [OPTIONS] [HOST[:PORT]]

Options:

  • -a, –auth-db: Specifies the database used to establish credentials and privileges with a MongoDB server. By default, the admin database is used.
  • -p, –password: Specifies the password to use when connecting to a server with authentication enabled. Do not add a space between the option and its value: -p<password>.
    If you specify the option without any value, pt-mongodb-summary will ask for the password interactively.
  • -u, –user: Specifies the user name for connecting to a server with authentication enabled.

By default, if you run pt-mongodb-summary without parameters, it tries to connect to the localhost on port 27017. It collects information about the MongoDB instances by running administration commands, and formatting the output.

Sections Instances

The first thing the tool does is get the list of hosts connected to the specified MongoDB instance by running the listShards command. It also runs replSetGetStatus on every instance to collect the ID, type, and replica set for each instance.

This host

Next, it gathers information about the host it is connected to by grouping information collected from hostInfo, getCmdLineOpts, serverStatus and the OS process (by process ID). The result provides an overview of the running instance and the underlying OS.

Running ops

This section collects statistics by running the serverStatus command five times at regular intervals (every one second), and provides the minimum, maximum and average operation counters for insert, query, update, delete, getMore and command operations.

Security

This collects information about security configurations by parsing the getCmdLineOpts command and asking the admin.system.users, and admin.system.roles collections.

Oplog

From the MongoDB website:

The oplog (operations log) is a special capped collection that keeps a rolling record of all operations that modify the data stored in your databases. MongoDB applies database operations on the primary and then records the operations on the primary’s oplog. The secondary members then copy and apply these operations in an asynchronous process. All replica set members contain a copy of the oplog, in the local.oplog.rs collection, which allows them to maintain the current state of the database.

How do we get the oplog info? The program collects statistics from the oplog for every host in the cluster, and returns the information on the statistics having the smaller TimeDiffHours  value.

Cluster-wide

This section provides information about the number of sharded/unsharded databases, collections and their size.The information is collected by running the listDatabases command, and then running collStats for every collection in every database.

Conditional Sections

You may notice not all sections appear all the time. This is because there are three main patterns:

Sharded Connection to Mongos
  • Instances
  • This host
  • Running ops
  • Security
  • Cluster-wide
ReplicaSet Connection
  • Instances (limited to the current Replica Set)
  • This host
  • Running ops
  • Security
  • Oplog
Standalone Connection
  • Instances (limited to this host)
  • This host
  • Running ops
  • Security
Output Example

The following is an example of the output for pt-mongodb-summary:

./pt-mongodb-summary # Instances ############################################################################################## PID Host Type ReplSet Engine 11037 localhost:17001 SHARDSVR/PRIMARY r1 wiredTiger 11065 localhost:17002 SHARDSVR/SECONDARY r1 wiredTiger 11136 localhost:17003 SHARDSVR/SECONDARY r1 wiredTiger 11256 localhost:17004 SHARDSVR/ARBITER r1 wiredTiger 11291 localhost:18001 SHARDSVR/PRIMARY r2 wiredTiger 11362 localhost:18002 SHARDSVR/SECONDARY r2 wiredTiger 11435 localhost:18003 SHARDSVR/SECONDARY r2 wiredTiger 11513 localhost:18004 SHARDSVR/ARBITER r2 wiredTiger 11548 localhost:19001 CONFIGSVR - wiredTiger 11571 localhost:19002 CONFIGSVR - wiredTiger 11592 localhost:19003 CONFIGSVR - wiredTiger # This host # Mongo Executable ####################################################################################### Path to executable | /home/karl/tmp/MongoDB32Labs/3.2/bin/mongos # Report On karl-HP-ENVY ######################################## User | karl PID Owner | mongos Hostname | karl-HP-ENVY Version | 3.2.4 Built On | Linux x86_64 Started | 2017-02-22 11:39:20 -0300 ART Processes | 12 Process Type | mongos # Running Ops ############################################################################################ Type Min Max Avg Insert 0 0 0/5s Query 0 0 0/5s Update 0 0 0/5s Delete 0 0 0/5s GetMore 0 0 0/5s Command 1 1 5/5s # Security ############################################################################################### Users : 0 Roles : 0 Auth : disabled SSL : disabled Port : 0 Bind IP: # Cluster wide ########################################################################################### Databases: 4 Collections: 21 Sharded Collections: 5 Unsharded Collections: 16 Sharded Data Size: 134.87 MB Unsharded Data Size: 1.44 GB ### Chunks: 5 : samples.col2 132 : carlos.sample4 400 : carlos.sample3 50 : carlos.sample2 100 : carlos.sample1 # Balancer (per day) Success: 18 Failed: 0 Splits: 682 Drops: 0

 The following is an output example when connected to a secondary in the replica set.

./pt-mongodb-summary localhost:17002 # Instances ############################################################################################## PID Host Type ReplSet Engine 9247 localhost:17001 SHARDSVR/PRIMARY r1 wiredTiger 9318 localhost:17002 SHARDSVR/SECONDARY r1 wiredTiger 9391 localhost:17003 SHARDSVR/SECONDARY r1 wiredTiger 9466 localhost:17004 SHARDSVR/ARBITER r1 wiredTiger # This host # Mongo Executable ####################################################################################### Path to executable | /home/karl/tmp/MongoDB32Labs/3.2/bin/mongod # Report On karl-HP-ENVY:17002 ######################################## User | karl PID Owner | mongod Hostname | karl-HP-ENVY:17002 Version | 3.2.4 Built On | Linux x86_64 Started | 2017-02-23 10:26:27 -0300 ART Datadir | labs/r1-2 Processes | 12 Process Type | replset # Running Ops ############################################################################################ Type Min Max Avg Insert 0 0 0/5s Query 0 0 0/5s Update 0 0 0/5s Delete 0 0 0/5s GetMore 0 1 1/5s Command 1 3 13/5s # Security ############################################################################################### Users : 0 Roles : 0 Auth : disabled SSL : disabled Port : 17002 Bind IP:
This next example shows when it is connected to a stand alone instance:

/pt-mongodb-summary localhost:27018 # Instances ############################################################################################## PID Host Type ReplSet Engine 1 localhost:27018 - wiredTiger # This host # Report On 2f8862dce6c4 ######################################## PID Owner | mongod Hostname | 2f8862dce6c4 Version | 3.2.10 Built On | Linux x86_64 Started | 2017-02-23 08:57:36 -0300 ART Processes | 1 Process Type | mongod # Running Ops ############################################################################################ Type Min Max Avg Insert 0 0 0/5s Query 0 0 0/5s Update 0 0 0/5s Delete 0 0 0/5s GetMore 0 0 0/5s Command 1 1 5/5s # Security ############################################################################################### Users : 0 Roles : 0 Auth : disabled SSL : disabled Port : 0 Bind IP: Conclusion

The tool pt-mongodb-summaryis new in Percona Toolkit. In the future, we hope we can make this grow to the size of its MySQL big brother!

Using Percona Toolkit pt-mongodb-query-digest

March 1, 2017 - 3:21pm

In this blog post, we’ll look at how to use the pt-mongodb-query-digest tool in Percona Toolkit 3.0.

Percona’s pt-query-digest is one of our most popular Percona Toolkit MySQL tools. It is used on a daily basis by DBAs and developers to help identify the queries consuming the most resources. It helps in finding bottlenecks and optimizing database usage. The pt-mongodb-query-digest is a similar tool for MongoDB.

About the Profiler

Before we start, remember that the MongoDB database profiler is disabled by default, and should be enabled. It can be enabled server-wide, but the full mode that logs all queries is not recommended in production unless you are using Percona Server for MongoDB 3.2 or higher. We added a feature to allow the sample rate of non-slow queries (like in MySQL) to limit the overhead this causes. 

Additionally, by default, the profiler is only 1MB per database. You may want to remove/create the profiler to sufficient size to find the results useful. To do this, use:

org_prof_level = db.getProfilingLevel(); //Disable Profiler db.setProfilingLevel(0); db.system.profile.drop(); //Setup a 100M profile 1*Math.pow(1024,2) == 1M profiler_size = 100 * Math.pow(1024,2); db.runCommand( { create: "system.profile", capped: true, size: profiler_size } ); db.setProfilingLevel(org_prof_level);

According to the documentation, to check if the profiler is enabled for the samples database, run:

`echo "db.getProfilingStatus();" | mongo localhost:17001/samples`

Remember, you need to connect to a MongoDB instance, not a mongos. The output will be something like this:

MongoDB shell version: 3.2.12 connecting to: localhost:17001/samples { "was" : 0, "slowms" : 100 } bye

The value for the field “was” is 0, which means profiling is disabled. Let’s enable the profiler for the samples database.

You must enable the profiler on all MongoDB instances that could be related to a shard of our database. To check on which instances we should enable the profiler, I am going to use the pt-mongodb-summary tool. It shows us the information we need about our cluster:

./pt-mongodb-summary ./pt-mongodb-summary # Instances ############################################################################################## PID Host Type ReplSet Engine 11037 localhost:17001 SHARDSVR/PRIMARY r1 wiredTiger 11065 localhost:17002 SHARDSVR/SECONDARY r1 wiredTiger 11136 localhost:17003 SHARDSVR/SECONDARY r1 wiredTiger 11256 localhost:17004 SHARDSVR/ARBITER r1 wiredTiger 11291 localhost:18001 SHARDSVR/PRIMARY r2 wiredTiger 11362 localhost:18002 SHARDSVR/SECONDARY r2 wiredTiger 11435 localhost:18003 SHARDSVR/SECONDARY r2 wiredTiger 11513 localhost:18004 SHARDSVR/ARBITER r2 wiredTiger 11548 localhost:19001 CONFIGSVR - wiredTiger 11571 localhost:19002 CONFIGSVR - wiredTiger 11592 localhost:19003 CONFIGSVR - wiredTiger

We have mongod service running on the localhost on ports 17001~17003 and 18001~18003.

Now, let’s enable the profiler for the samples database on those instances. For this example, I am going to set the profile level to “2”, to collect information about all queries.

for port in 17001 17002 17003 18001 18002 18003; do echo "db.setProfilingLevel(2);" | mongo localhost:${port}/samples; done Running pt-mongodb-query-profile

Now we are ready to get statistics about our queries. To run pt-mongodb-query-digest, we need to specify at least “host: port/database”, like:

./pt-mongodb-query-digest localhost:27017/samples

The output will be something like this (I am showing a section for only one query):

# Query 0: 0.27 QPS, ID 2c0e2f94937d6660f510adeea98618f3 # Ratio 1.00 (docs scanned/returned) # Time range: 2017-02-22 12:27:21.004 -0300 ART to 2017-02-22 12:28:00.867 -0300 ART # Attribute pct total min max avg 95% stddev median # ================== === ======== ======== ======== ======== ======== ======= ======== # Count (docs) 845 # Exec Time ms 99 1206 0 697 1 0 29 0 # Docs Scanned 7 594.00 0.00 75.00 0.70 0.00 7.19 0.00 # Docs Returned 7 594.00 0.00 75.00 0.70 0.00 7.19 0.00 # Bytes recv 0 8.60M 215.00 1.06M 10.17K 215.00 101.86K 215.00 # String: # Namespaces samples.col1 # Operation query # Fingerprint user_id # Query {"user_id":{"$gte":3506196834,"$lt":3206379780}}

From the output, we can see that this query was seen 97 times, and it provides statistics for the number of documents scanned/retrieved by the server, the execution time and size of the results. The tool also provides information regarding the operation type, the fingerprint and a query example to help to identify the source. 

By default, the results are sorted by query count. It can be changed by setting the --order-by parameter to: count, ratio, query-time, docs-scanned or docs-returned.

A “-” in front of the field name denotes the reverse order. Example:

--order-by=-ratio

When considering what ordering to use, you need to know if you are looking for the most common queries (-count), the most cache abusive (-docs-scanned), or the worst ratio of scanned to returned (-ratio)? Please note you may be tempted to use (-query-time), however you will find this almost always ends up being more queries affected by, but not causing, issues.

Conclusion

This is a new tool in the Percona Toolkit. We hope in the future we can make it grow like its big brother for MySQL (pt-query-digest). This tool helps DBAs and developers identify and solve bottlenecks, and keep servers running at top performance.

Open Source Databases on Big Machines: Disk Speed and innodb_io_capacity

March 1, 2017 - 3:00pm

In this blog post, I’ll look for the bottleneck that prevented the performance in my previous post from achieving better results.

The powerful machine I used in the tests in my previous post has a comparatively slow disk, and therefore I expected my tests would hit a point when I couldn’t increase performance further due to the disk speed.

Hardware configuration:

Processors: physical = 4, cores = 72, virtual = 144, hyperthreading = yes
Memory: 3.0T
Disk speed: about 3K IOPS
OS: CentOS 7.1.1503
File system: XFS

Versions tested and configuration: same as in the first post of this series (check the post for specifics).

Even though I expected my tests would stop increasing in performance due to the disk speed, I did not observe high IO rates in the iostat output. I already tested with a full data set that fits in memory. In this case, write performance only affected data flushes and log writes. But we should still see a visible decrease in speed. So I decided to try RW tests totally in memory. I created a ramdisk and put the MySQL datadir on it. Surprisingly, results on the SSD and ramdisk did not differ.

I asked my colleagues from “Postgres Professional” to test PostgreSQL with the ramdisk. They got similar results:

It’s interesting that the value of innodb_io_capacity does not have any effect on this situation. Data for the graph below was taken when I ran tests on ramdisk. I wanted to see if I could control the IO activity of a disk, which is extremely fast by default, using this variable.

This totally contradicts all my past experiences with smaller machines. Percona re-purposed the machine with a faster disk (which I used before, described in this post), so I used a similar one with slower disk speed.

Hardware configuration:

Processors: physical = 2, cores = 12, virtual = 24, hyperthreading = yes
Memory: 47.2G
Disk speed: about 3K IOPS
OS: Ubuntu 14.04.5 LTS (trusty)
File system: ext4

Again, in this case innodb_io_capacity benchmarks with a smaller number of CPU cores showed more predictable results.

Conclusion:

Both MySQL and PostgreSQL on a machine with a large number of CPU cores hit CPU resources limits before disk speed can start affecting performance. We only tested one scenario, however. With other scenarios, the results might be different.

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Percona Monitoring and Management (PMM) Graphs Explained: MongoDB MMAPv1

February 28, 2017 - 3:28pm

This post is part of the series of Percona’s MongoDB 3.4 bundle release blogs. In this blog post, I hope to cover some areas to watch with Percona Monitoring and Management (PMM) when running MMAPv1. The graph examples from this article are from the MMAPv1 dashboard that will be released for the first time in PMM 1.1.2.

Since the very beginning of MongoDB, the MMAPv1 storage engine has existed. MongoDB 3.0 added a pluggable storage engine API. You could only use MMAPv1 with MongoDB before that. While MMAPv1 often offers good read performance, it has become famous for its poor write performance and fragmentation at scale. This means there are many areas to watch for regarding performance and monitoring.

Percona Monitoring and Management (PMM)

Percona Monitoring and Management (PMM) is an open-source platform for managing and monitoring MySQL and MongoDB. It was developed by Percona on top of open-source technology. Behind the scenes, the graphing features this article covers use Prometheus (a popular time-series data store), Grafana (a popular visualization tool), mongodb_exporter (our MongoDB database metric exporter) plus other technologies to provide database and operating system metric graphs for your database instances.

(Beware of) MMAPv1

mmap() is a system-level call that causes the operating system kernel to map on-disk files to memory while it is being read and written by a program.

As mmap() is a core feature of the Unix/Linux operating system kernel (and not the MongoDB code base), I’ve always felt that calling MMAPv1 a “storage engine” is quite misleading, although it does allow for a simpler explanation. The distinction and drawbacks of the storage logic being in the operating system kernel vs. the actual database code (like most database storage engines) becomes very important when monitoring MMAPv1.

As Unix/Linux are general-purpose operating systems that can have many processes, users and uses cases, they offer limited OS-level metrics in terms of activity, latency and performance of mmap(). Those metrics are for the entire operating system, not just for the MongoDB processes.

mmap() uses memory from available OS-level buffers/caches for mapping the MMAPv1 data to RAM — memory that can be “stolen” away by any other operating system process that asks for it. As many deployments “micro-shard” MMAPv1 to reduce write locks, this statement can become exponentially more important. If 3 x MongoDB instances run on a single host, the kernel fights to cache and evict memory pages created by 3 x different instances with no priority or queuing, essentially at random, while creating contention. This causes inefficiencies and less-meaningful monitoring values.

When monitoring MMAPv1, you should consider MongoDB AND the operating system as one “component” more than most engines. Due to this, it is critical that a database host runs a single MongoDB instance with no other processes except database monitoring tools such as PMM’s client. This allows MongoDB to be the only user of the operating system filesystem cache that MMAPv1 relies on. This also makes OS-level memory metrics more accurate because MongoDB is the only user of memory. If you need to “micro-shard” instances, I recommend using containers (Docker or plain cgroups) or virtualization to separate your memory for each MongoDB instance, with just one MongoDB instance per container.

Locking

MMAPv1’s has locks for both reads and writes. In the early days the lock was global only. Locking became per-database in v2.2 and per-collection in v3.0.

Locking is the leading cause of the performance issues we see on MMAPv1 systems, particularly write locking. To measure how much locking an MMAPv1 instance is waiting on, first we look at the “MMAPv1 Lock Ratio”:

Another important metric to watch is “MongoDB Lock Wait Time”, breaking down a number of time operations spend waiting on locks:

Three factors in combination influence locking:

  1. Data hotspots — if every query hits the same collection or database, locking increases
  2. Query performance — a lock is held for the duration of an operation; if that operation is slow, lock time increases
  3. Volume of queries — self-explanatory

Page Faults

Page faults happen when MMAPv1 data is not available in the cache and needs to be fetched from disk. On systems with data that is smaller than memory page faults usually only occur on reboot, or if the file system cache is dumped. On systems where data exceeds memory, this happens more frequently — MongoDB is asked for data not in memory.

How often this happens depends on how your application accesses your data. If it accesses new or frequently-queried data, it is more likely to be in memory. If it accesses old or infrequent data, more page faults occur.

If page faults suddenly start occurring, check to see if your data set has grown beyond the size of memory. You may be able to reduce your data set by removing fragmentation (explained later).

Journaling

As MMAPv1 eventually flushes changes to disk in batches, journaling is essential for running MongoDB with any real data integrity guarantees. As well as being included in the lock statistic graphs mentioned above, there are some good metrics for journaling (which is a heavy consumer of disk writes).

Here we have “MMAPv1 Journal Write Activity”, showing the data rates of journaling (max 19MB/sec):

“MMAPv1 Journal Commit Activity” measures the commits to the journal ops/second:

A very useful metric for write query performance is “MMAPv1 Journaling Time” (there is another graph with 99th percentile times):

This is important to watch, as write operations need to wait for a journal commit. In the above example, “write_to_journal” and “write_to_data_files” are the main metrics I tend to look at. “write_to_journal” is the rate of changes being written to the journal, and “write_to_data_files” is the rate that changes are written to on-disk data.

If you see very high journal write times, you may need faster disks or in-sharding scenarios. Adding more shards spreads out the disk write load.

Background Flushing

“MMAPv1 Background Flushing Time” graphs the background operation that calls flushes to disk:

This process does not block the database, but does cause more disk activity.

Fragmentation

Due to the way MMAPv1 writes to disk, it creates a high rate of fragmentation (or holes) in its data files. Fragmentation slows down scan operations, wastes some filesystem cache memory and can use much more disk space than there is actual data. On many systems I’ve seen, the size of MMAPv1 data files on disk take over twice the true data size.

Currently, our Percona Monitoring and Management MMAPv1 support does not track this, but we plan to add it in the future.

To track it manually, look at the output of the “.stats()” command for a given collection (replace “sbtest1” with your collection name):

> 1 - ( db.sbtest1.stats().size / db.sbtest1.stats().storageSize ) 0.14085410557184752

Here we can see this collection is about 14% fragmented on disk. To fix fragmentation, the most common fix is dropping and recreating the collection using a backup. Many just remove a replication member, clear the data and let it do a new replication initial sync.

Operating System Memory

In PMM we have graphed the operating system cached memory as it acts as the primary cache for MMAPv1:

For the most part, “Cached” is the value showing the amount of data that is cached MMAPv1 data (assuming the host is only running MongoDB).

We also graph the dirty memory pages:

It is important that dirty pages do not exceed the hard dirty page limit (which causes pauses). It is also important that dirty pages don’t accumulate (which wastes cache memory). The “soft” dirty page limit is the limit that starts dirty page cleanup without pausing.

On this host, you could probably lower the soft limit to clean up memory faster, assuming the increase in disk activity is acceptable. This topic is covered in this post: https://www.percona.com/blog/2016/08/12/tuning-linux-for-mongodb/.

What’s Missing?

As mentioned earlier, fragmentation rates are missing for MMAPv1 (this would be a useful addition). Due to the limited nature of the metrics offered for MMAPv1, PMM probably won’t provide the same level of graphs for MMAPv1 compared to what we provide for WiredTiger or RocksDB. There will likely be fewer additions to the graphing capabilities going forward.

If you are using a highly concurrent system, we highly recommend you upgrade to WiredTiger or RocksDB (both also covered in this monitoring series). These engines provide several solutions to MMAPv1 headaches: document-level locking, built-in compression, checkpointing that cause near-zero fragmentation on disk and much-improved visibility for monitoring. We just released Percona Server for MongoDB 3.4, and it provides many exciting features (including these engines).

Look out for more monitoring posts from this series!

Webinar Thursday March 2, 2017: MongoDB Query Patterns

February 27, 2017 - 5:06pm

Join Percona’s Senior Technical Services Engineer Adamo Tonete on Thursday, March 2, 2017, at 11:00 a.m. PST / 2:00 p.m. EST (UTC-8) as he reviews and discusses MongoDB® query patterns.

Register Now

MongoDB is a fast and simple-to-query schema-free database. It features a smart query optimizer that tries to use the easiest data retrieval method.

In this webinar, Adamo will discuss common query operators and how to use them effectively. The webinar will cover not only common query operations, but also the best practices for their usage.

Register for the webinar here.

Adamo Tonete, Senior Technical Services Engineer

Adamo joined Percona in 2015, after working as a MongoDB/MySQL Database Administrator for three years. As the main database member of a startup, he was responsible for suggesting the best architecture and data flows for a worldwide company in a 24/7 environment. Before that, he worked as a Microsoft SQL Server DBA for a large e-commerce company, mainly on performance tuning and automation. Adamo has almost eight years of experience working as a DBA, and in the past three years, he has moved to NoSQL technologies without giving up relational databases.

MySQL Ransomware: Open Source Database Security Part 3

February 27, 2017 - 2:28pm

This blog post examines the recent MySQL® ransomware attacks, and what open source database security best practices could have prevented them.

Unless you’ve been living under a rock, you know that there has been an uptick in ransomware for MongoDB and Elasticsearch deployments. Recently, we’re seeing the same for MySQL.

Let’s look and see if this is MySQL’s fault.

Other Ransomware Targets

Let’s briefly touch on how Elasticsearch and MongoDB became easy targets…

Elasticsearch

Elasticsearch® does not implement any access control: neither authentication nor authorization. For this, you need to deploy the Elastic’s shield offering. As such, if you have an Elasticsearch deployment that is addressable from the Internet, you’re asking for trouble. We see many deployments have some authentication around their access, such as HTTP Basic Auth – though sadly, some don’t employ authentication or network isolation. We already wrote a blog about this here.

MongoDB

MongoDB (< 2.6.0) does allow for access control through account creation. It binds to 0.0.0.0 by default (allowing access from anywhere). This is now changed in /etc/mongod.conf in versions >= 2.6.0. Often administrators don’t realize or don’t know to look for this. (Using MongoDB? My colleague David Murphy wrote a post on this issue here).

We began to see incidents where both Elasticsearch and MongoDB had their datasets removed and replaced with a README/note instructing the user to pay a ransom of 0.2BTC (Bitcoin) to the specified wallet address (if they wanted their data back).

MySQL

So is this latest (and similar) attack on MySQL MySQL’s fault? We don’t think so. MySQL and Percona Server® for MySQL by default do not accept authentication from everywhere without a password for the root user.

Let’s go over the various security options MySQL has, and describe some other best practices in order to protect your environment.

Default bind_address=127.0.0.1 in Percona Server for MySQL

MySQL currently still binds to 0.0.0.0 (listen to all network interfaces) by default. However, Percona Server for MySQL and Percona XtraDB Cluster have different defaults, and only bind on 127.0.0.1:3306 in its default configuration (Github pull request).

Recall, if you will, CVE-2012-2122. This ALONE should be enough to ensure that you as the administrator use best practices, and ONLY allow access to the MySQL service from known good sources. Do not setup root level or equivalent access from any host (% indicates any host is allowed). Ideally, you should only allow root access from 127.0.0.1 – or if you must, from a subset of a secured network (e.g., 10.10.0.% would only allow access to 10.10.0.0/24).

Prevent Access

Also, does the MySQL database really need a publicly accessible IP address? If you do have a valid reason for this, then you should firewall port 3306 and whitelist access only from hosts that need to access the database directly. You can easily use iptables for this.

Default Users

MySQL DOES NOT by default create accounts that can be exploited for access. This comes later through an administrator’s lack of understanding, sadly. More often than not, the grant will look something like the following.

GRANT ALL PRIVILEGES TO 'root'@'%' IDENTIFIED BY '123456' WITH GRANT OPTION;

You may scoff at the above (and rightly so). However, don’t discount this just yet: “123456” was the MOST USED password in 2016! So it’s reasonable to assume that somewhere out there this is a reality.

Max Connection Errors

You can deploy max_connection_errors with a suitably low value to help mitigate a direct attack. This will not prevent a distributed attack, where many thousands of hosts are used. Network isolation is the only way to ensure your mitigation against this attack vector.

MySQL 5.7 Improvements on Security Default Root Password

Since MySQL 5.7, a random password is generated for the only root user (root@localhost) when you install MySQL for the first time. That password is then written in the error log and has to be changed. Miguel Ángel blogged about this before.

Connection Control Plugin

MySQL 5.7.17 introduced a new open source plugin called Connection Control. When enabled, it delays the authentication of users that failed to login by default more than three times. This is also part as of Percona Server for MySQL 5.7.17.

Here’s an example where the 4th consecutive try caused a one-second delay (default settings were used):

$ time mysql -u bleh2 -pbleh ERROR 1045 (28000): Access denied for user 'bleh2'@'localhost' (using password: YES) real 0m0.009s $ time mysql -u bleh2 -pbleh ERROR 1045 (28000): Access denied for user 'bleh2'@'localhost' (using password: YES) real 0m0.008s $ time mysql -u bleh2 -pbleh ERROR 1045 (28000): Access denied for user 'bleh2'@'localhost' (using password: YES) real 0m0.008s $ time mysql -u bleh2 -pbleh ERROR 1045 (28000): Access denied for user 'bleh2'@'localhost' (using password: YES) real 0m1.008s mysql> SELECT * FROM INFORMATION_SCHEMA.CONNECTION_CONTROL_FAILED_LOGIN_ATTEMPTS; +---------------------+-----------------+ | USERHOST | FAILED_ATTEMPTS | +---------------------+-----------------+ | 'bleh2'@'localhost' | 4 | +---------------------+-----------------+ 1 row in set (0.01 sec)

Password Validation Plugin

MySQL 5.6.6 and later versions also ship with a password validation plugin, which prevents creating users with unsafe passwords (such as 123456) by ensuring passwords meet certain criteria: https://dev.mysql.com/doc/refman/5.7/en/validate-password-plugin.html

Summary

In order to get stung, one must ignore the best practices mentioned above (which in today’s world, should take some effort). These best practices include:

  1. Don’t use a publicly accessible IP address with no firewall configured
  2. Don’t use a root@% account, or other equally privileged access account, with poor MySQL isolation
  3. Don’t configure those privileged users with a weak password, allowing for brute force attacks against the MySQL service

Hopefully, these are helpful security tips for MySQL users. Comment below!

Percona Monitoring and Management (PMM) Graphs Explained: WiredTiger and Percona Memory Engine

February 27, 2017 - 9:34am

This post is part of the MongoDB 3.4 bundled release series of blog posts. In this blog, we’ll go over some useful metrics WiredTiger outputs and how we visualize them in Percona Monitoring and Management (PMM).

WiredTiger is the default storage engine for MongoDB since version 3.2. The addition of this full-featured, comprehensive storage engine offered a lot of new, useful metrics that were not available before in MMAPv1.

Percona Monitoring and Management (PMM)

Percona Monitoring and Management (PMM) is an open-source platform for managing and monitoring MySQL and MongoDB, developed by Percona on top of open-source technology. Behind the scenes, the graphing features this article covers use Prometheus (a popular time-series data store), Grafana (a popular visualization tool), mongodb_exporter (our MongoDB database metric exporter) plus other technologies to provide database and operating system metric graphs for your database instances.

Please see a live demo of our PMM 1.1.1 release of the MongoDB WiredTiger graphs covered in this article: https://pmmdemo.percona.com/graph/dashboard/db/mongodb-wiredtiger.

You can see a sneak peak demo of our Percona Memory Engine graphs we’ll release in PMM 1.1.2 here: https://pmmdemo.percona.com/graph/dashboard/db/mongodb-inmemory.

WiredTiger and Percona Memory Engine

WiredTiger is a storage engine that was developed outside of MongoDB, and was acquired and integrated into MongoDB in version 3.0. WiredTiger offers document-level locking, inline compression and many other useful storage engine features. WiredTiger writes data to disk in “checkpoints” and internally uses Multi-Version Concurrency Control (MVCC) to create “transactions” or “snapshots” when accessing data in the engine. In WiredTiger’s metrics, you will see the term “transactions” used often. It is important to note, however, that MongoDB does not support transactions at this time (this only occurs within the storage engine).

WiredTiger has an in-heap cache for mostly uncompressed pages (50% RAM by default). Like many other engines, it relies on the performance of the Linux filesystem cache, which ends up caching hot, compressed WiredTiger disk blocks.

Besides supporting WiredTiger, Percona Server for MongoDB also ships with a free, open-source in-memory storage engine: Percona Memory Engine for MongoDB. Since we based the Memory Engine on WiredTiger, all graphs and troubleshooting techniques for in-memory are essentially the same (the database data is not stored on disk, of course).

Checkpointing Graphs

WiredTiger checkpoints data to disk every 60 seconds, or after writing 2GB of journaled data.

PMM graphs the current minimum and maximum checkpoint times for WiredTiger checkpoints in the “WiredTiger Checkpoint Time” graph:

Above I have selected “current,” and we can see we have an average of 176ms checkpoints and over a long period it remains flat, not worsening or “snowballing” each checkpoint (which may indicate a performance issue).

Checkpointing is important to watch because it requires WiredTiger to use system resources, and also can affect query performance in an possibly unexpected way — WiredTiger Cache dirty pages:

The WiredTiger Cache is an LRU cache of mostly uncompressed pages. Like most caches, it creates dirty pages that can take up useful memory until flushed. The WiredTiger Cache uses checkpointing as the point in which it clears dirty pages, making the relationship between dirty pages and checkpointing important to note. WiredTiger cleans dirty pages less often if checkpoint performance is slow. They then can slowly consume more and more of the available cache memory.

In the above graph, we can see on average about 8.8% of the cache is dirty pages with spikes up/down aligning with checkpointing. Systems with a very high rate of dirty pages benefit from more RAM to provide more room for “clean” pages. Another option could be improving storage performance, so checkpoints happen faster.

Concurrency Graph

Similar to InnoDB, WiredTiger uses a system of tickets to control concurrency. Where things differ from InnoDB is both “reads” and “writes” have their own ticket pools with their own maximum-ticket limits. The defaults of “128” tickets for both read and write concurrency is generally enough for even medium-high usage systems. Some systems are capable of more than the default concurrency limit, however (usually systems with very fast storage). Also, concurrency can sometimes reduce overhead on network-based storage.

If you notice higher ticket usage, it can sometimes be due to a lot of single-document locking in WiredTiger. This is something to check if you see high rates alongside storage performance and general query efficiency.

In Percona Monitoring and Management, we have the “WiredTiger Concurrent Transactions” graph to visualize the usage of the tickets. In most cases, tickets shouldn’t reach the limit and you shouldn’t need to tweak this tuneable. If you do require more concurrency, however, PMM’s graphing helps indicate when limits are being reached and whether a new limit will mitigate the problem.

Here we can see a max usage of 8/128 write tickets and 5/128 read tickets. This means this system isn’t having any concurrency issues.

Throughput Graphs

There are several WiredTiger graphs to explain the rate of data moving through the engine. As storage is a common bottleneck, I generally look at “WiredTiger Block Activity” first when investigating storage resource usage. This graph shows the total rates written and read to/from storage by WiredTiger (disk for WiredTiger, memory for in-memory).

For correlation, there are also rates for the amount of data written from and read into the WiredTiger cache, from disk. The “read” metric shows the rate of data added to the cache due to query patterns (e.g.: scanning), while the “written” metric shows the rate of data written out to storage from the WiredTiger cache.

Also there are rates to explain the IO caused by the WiredTiger Log. The metric “payload” is the essentially the write rate of raw BSON pages, and “written” is a combined total of log bytes written (including overhead, likely the frames around the payload, etc.). You should watch changes to the average rate of “read” carefully, as they may indicate changes in query patterns or efficiency.

Detailed Cache Graphs

In addition to the Dirty Pages in the cache graph, “WiredTiger Cache Capacity” graphs the size and usage of the WiredTiger cache:

The rate of cache eviction is graphed in “WiredTiger Cache Eviction,” with a break down of modified vs. unmodified pages:

Very large spikes in eviction can indicate collection scanning or generally poor performing queries. This pushes data out of caches. You should avoid high rates of cache evictions, as they can cause a high overhead to the overall engine.

When increasing the size of the WiredTiger cache it is useful to look at both of the above cache graphs. You should look for more “Used” memory in the “WiredTiger Cache Capacity” graph and less rate of eviction in the “WiredTiger Cache Eviction” graph. If you do not see changes to these metrics, you may see better performance leaving the cache size as-is.

Transactions and Document Operations

The “WiredTiger Transactions” graph shows the overall operations happening inside the engine. All transactions start with a “begin,” and operations that changed data end with a “commit.” Read-only operations show a “rollback” at the time they returned data:

This graph above correlates nicely with the “Mongod – Document Activity” graph, which shows the rate of operations from the MongoDB-layer perspective instead of the storage engine level:

Detailed Log Graphs

The graph “WiredTiger Log Operations” explains activity inside the WiredTiger Log system:

Also, the rate of log record compression is graphed as “WiredTiger Log Records.” WiredTiger only compresses log operations that are greater than 128 bytes, which explains why some log records are not compressed:

In some cases, changes in the ratio of compressed vs. uncompressed pages may help explain changes in CPU% used.

What’s Missing?

As you’ll see in my other blog post “Percona Monitoring and Management (PMM) Graphs Explained: MongoDB with RocksDB” from this series, RocksDB includes read latency metrics and a hit ratio for the RocksDB block cache. These are two things I would like to see added to WiredTiger’s metric output, and thus PMM. I would also like to improve the user-experience of this dashboard. Some areas use linear-scaled graphs when a logarithmic-scaled graph could provide more value. “WiredTiger Concurrent Transactions” is one example of this.

A known-mystery (so-to-speak) is why WiredTiger reports the cache “percentage overhead” always as 8% in “db.serverStatus().cache.” We added this metric to PMM as a graph named “WiredTiger Cache Overhead.” We assumed it provided a variable overhead metric. However, I’ve seen that it returns 8% regardless of usage: it is 8% on a busy system or even on an empty system with no data or traffic. We’re aware of this, and plan to investigate, as a hit ratio for the cache is a very valuable metric:

Also, if you’ve ever seen the full output of the WiredTiger status metrics (‘db.serverStatus().wiredTiger’ in Mongo shell), you’ll know that there are a LOT more WiredTiger metrics than are currently graphed in Percona Monitoring and Management. In our initial release, we’ve aimed to only include high-value graphs to simplify monitoring WiredTiger. A major barrier in our development of monitoring features for WiredTiger has been the little-to-no documentation on the meaning of many status metrics. I hope this improves with time. As we understand more correlations and useful metrics to determine the health of WiredTiger, we plan to integrate those into Percona Monitoring and Management in the future. As always, we appreciate your suggestions.

Lastly, look out for an upcoming blog post from this series regarding creating custom dashboards, graphs and raw data queries with Percona Monitoring and Management!

Visit Percona Store


General Inquiries

For general inquiries, please send us your question and someone will contact you.