In this blog post, I’ll look for the bottleneck that prevented the performance in my previous post from achieving better results.
The powerful machine I used in the tests in my previous post has a comparatively slow disk, and therefore I expected my tests would hit a point when I couldn’t increase performance further due to the disk speed.
Processors: physical = 4, cores = 72, virtual = 144, hyperthreading = yes
Disk speed: about 3K IOPS
OS: CentOS 7.1.1503
File system: XFS
Versions tested and configuration: same as in the first post of this series (check the post for specifics).
Even though I expected my tests would stop increasing in performance due to the disk speed, I did not observe high IO rates in the iostat output. I already tested with a full data set that fits in memory. In this case, write performance only affected data flushes and log writes. But we should still see a visible decrease in speed. So I decided to try RW tests totally in memory. I created a ramdisk and put the MySQL datadir on it. Surprisingly, results on the SSD and ramdisk did not differ.
I asked my colleagues from “Postgres Professional” to test PostgreSQL with the ramdisk. They got similar results:
It’s interesting that the value of innodb_io_capacity does not have any effect on this situation. Data for the graph below was taken when I ran tests on ramdisk. I wanted to see if I could control the IO activity of a disk, which is extremely fast by default, using this variable.
This totally contradicts all my past experiences with smaller machines. Percona re-purposed the machine with a faster disk (which I used before, described in this post), so I used a similar one with slower disk speed.
Processors: physical = 2, cores = 12, virtual = 24, hyperthreading = yes
Disk speed: about 3K IOPS
OS: Ubuntu 14.04.5 LTS (trusty)
File system: ext4
Again, in this case innodb_io_capacity benchmarks with a smaller number of CPU cores showed more predictable results.
Both MySQL and PostgreSQL on a machine with a large number of CPU cores hit CPU resources limits before disk speed can start affecting performance. We only tested one scenario, however. With other scenarios, the results might be different.
This post is part of the series of Percona’s MongoDB 3.4 bundle release blogs. In this blog post, I hope to cover some areas to watch with Percona Monitoring and Management (PMM) when running MMAPv1. The graph examples from this article are from the MMAPv1 dashboard that will be released for the first time in PMM 1.1.2.
Since the very beginning of MongoDB, the MMAPv1 storage engine has existed. MongoDB 3.0 added a pluggable storage engine API. You could only use MMAPv1 with MongoDB before that. While MMAPv1 often offers good read performance, it has become famous for its poor write performance and fragmentation at scale. This means there are many areas to watch for regarding performance and monitoring.
Percona Monitoring and Management (PMM)
Percona Monitoring and Management (PMM) is an open-source platform for managing and monitoring MySQL and MongoDB. It was developed by Percona on top of open-source technology. Behind the scenes, the graphing features this article covers use Prometheus (a popular time-series data store), Grafana (a popular visualization tool), mongodb_exporter (our MongoDB database metric exporter) plus other technologies to provide database and operating system metric graphs for your database instances.
(Beware of) MMAPv1
mmap() is a system-level call that causes the operating system kernel to map on-disk files to memory while it is being read and written by a program.
As mmap() is a core feature of the Unix/Linux operating system kernel (and not the MongoDB code base), I’ve always felt that calling MMAPv1 a “storage engine” is quite misleading, although it does allow for a simpler explanation. The distinction and drawbacks of the storage logic being in the operating system kernel vs. the actual database code (like most database storage engines) becomes very important when monitoring MMAPv1.
As Unix/Linux are general-purpose operating systems that can have many processes, users and uses cases, they offer limited OS-level metrics in terms of activity, latency and performance of mmap(). Those metrics are for the entire operating system, not just for the MongoDB processes.
mmap() uses memory from available OS-level buffers/caches for mapping the MMAPv1 data to RAM — memory that can be “stolen” away by any other operating system process that asks for it. As many deployments “micro-shard” MMAPv1 to reduce write locks, this statement can become exponentially more important. If 3 x MongoDB instances run on a single host, the kernel fights to cache and evict memory pages created by 3 x different instances with no priority or queuing, essentially at random, while creating contention. This causes inefficiencies and less-meaningful monitoring values.
When monitoring MMAPv1, you should consider MongoDB AND the operating system as one “component” more than most engines. Due to this, it is critical that a database host runs a single MongoDB instance with no other processes except database monitoring tools such as PMM’s client. This allows MongoDB to be the only user of the operating system filesystem cache that MMAPv1 relies on. This also makes OS-level memory metrics more accurate because MongoDB is the only user of memory. If you need to “micro-shard” instances, I recommend using containers (Docker or plain cgroups) or virtualization to separate your memory for each MongoDB instance, with just one MongoDB instance per container.
MMAPv1’s has locks for both reads and writes. In the early days the lock was global only. Locking became per-database in v2.2 and per-collection in v3.0.
Locking is the leading cause of the performance issues we see on MMAPv1 systems, particularly write locking. To measure how much locking an MMAPv1 instance is waiting on, first we look at the “MMAPv1 Lock Ratio”:
Another important metric to watch is “MongoDB Lock Wait Time”, breaking down a number of time operations spend waiting on locks:
Three factors in combination influence locking:
Page faults happen when MMAPv1 data is not available in the cache and needs to be fetched from disk. On systems with data that is smaller than memory page faults usually only occur on reboot, or if the file system cache is dumped. On systems where data exceeds memory, this happens more frequently — MongoDB is asked for data not in memory.
How often this happens depends on how your application accesses your data. If it accesses new or frequently-queried data, it is more likely to be in memory. If it accesses old or infrequent data, more page faults occur.
If page faults suddenly start occurring, check to see if your data set has grown beyond the size of memory. You may be able to reduce your data set by removing fragmentation (explained later).
As MMAPv1 eventually flushes changes to disk in batches, journaling is essential for running MongoDB with any real data integrity guarantees. As well as being included in the lock statistic graphs mentioned above, there are some good metrics for journaling (which is a heavy consumer of disk writes).
Here we have “MMAPv1 Journal Write Activity”, showing the data rates of journaling (max 19MB/sec):
“MMAPv1 Journal Commit Activity” measures the commits to the journal ops/second:
A very useful metric for write query performance is “MMAPv1 Journaling Time” (there is another graph with 99th percentile times):
This is important to watch, as write operations need to wait for a journal commit. In the above example, “write_to_journal” and “write_to_data_files” are the main metrics I tend to look at. “write_to_journal” is the rate of changes being written to the journal, and “write_to_data_files” is the rate that changes are written to on-disk data.
If you see very high journal write times, you may need faster disks or in-sharding scenarios. Adding more shards spreads out the disk write load.
“MMAPv1 Background Flushing Time” graphs the background operation that calls flushes to disk:
This process does not block the database, but does cause more disk activity.
Due to the way MMAPv1 writes to disk, it creates a high rate of fragmentation (or holes) in its data files. Fragmentation slows down scan operations, wastes some filesystem cache memory and can use much more disk space than there is actual data. On many systems I’ve seen, the size of MMAPv1 data files on disk take over twice the true data size.
Currently, our Percona Monitoring and Management MMAPv1 support does not track this, but we plan to add it in the future.
To track it manually, look at the output of the “.stats()” command for a given collection (replace “sbtest1” with your collection name):> 1 - ( db.sbtest1.stats().size / db.sbtest1.stats().storageSize ) 0.14085410557184752
Here we can see this collection is about 14% fragmented on disk. To fix fragmentation, the most common fix is dropping and recreating the collection using a backup. Many just remove a replication member, clear the data and let it do a new replication initial sync.
Operating System Memory
In PMM we have graphed the operating system cached memory as it acts as the primary cache for MMAPv1:
For the most part, “Cached” is the value showing the amount of data that is cached MMAPv1 data (assuming the host is only running MongoDB).
We also graph the dirty memory pages:
It is important that dirty pages do not exceed the hard dirty page limit (which causes pauses). It is also important that dirty pages don’t accumulate (which wastes cache memory). The “soft” dirty page limit is the limit that starts dirty page cleanup without pausing.
On this host, you could probably lower the soft limit to clean up memory faster, assuming the increase in disk activity is acceptable. This topic is covered in this post: https://www.percona.com/blog/2016/08/12/tuning-linux-for-mongodb/.
As mentioned earlier, fragmentation rates are missing for MMAPv1 (this would be a useful addition). Due to the limited nature of the metrics offered for MMAPv1, PMM probably won’t provide the same level of graphs for MMAPv1 compared to what we provide for WiredTiger or RocksDB. There will likely be fewer additions to the graphing capabilities going forward.
If you are using a highly concurrent system, we highly recommend you upgrade to WiredTiger or RocksDB (both also covered in this monitoring series). These engines provide several solutions to MMAPv1 headaches: document-level locking, built-in compression, checkpointing that cause near-zero fragmentation on disk and much-improved visibility for monitoring. We just released Percona Server for MongoDB 3.4, and it provides many exciting features (including these engines).
Look out for more monitoring posts from this series!
MongoDB is a fast and simple-to-query schema-free database. It features a smart query optimizer that tries to use the easiest data retrieval method.
In this webinar, Adamo will discuss common query operators and how to use them effectively. The webinar will cover not only common query operations, but also the best practices for their usage.Adamo Tonete, Senior Technical Services Engineer
Adamo joined Percona in 2015, after working as a MongoDB/MySQL Database Administrator for three years. As the main database member of a startup, he was responsible for suggesting the best architecture and data flows for a worldwide company in a 24/7 environment. Before that, he worked as a Microsoft SQL Server DBA for a large e-commerce company, mainly on performance tuning and automation. Adamo has almost eight years of experience working as a DBA, and in the past three years, he has moved to NoSQL technologies without giving up relational databases.
Unless you’ve been living under a rock, you know that there has been an uptick in ransomware for MongoDB and Elasticsearch deployments. Recently, we’re seeing the same for MySQL.
Let’s look and see if this is MySQL’s fault.Other Ransomware Targets
Let’s briefly touch on how Elasticsearch and MongoDB became easy targets…Elasticsearch
Elasticsearch® does not implement any access control: neither authentication nor authorization. For this, you need to deploy the Elastic’s shield offering. As such, if you have an Elasticsearch deployment that is addressable from the Internet, you’re asking for trouble. We see many deployments have some authentication around their access, such as HTTP Basic Auth – though sadly, some don’t employ authentication or network isolation. We already wrote a blog about this here.MongoDB
MongoDB (< 2.6.0) does allow for access control through account creation. It binds to 0.0.0.0 by default (allowing access from anywhere). This is now changed in /etc/mongod.conf in versions >= 2.6.0. Often administrators don’t realize or don’t know to look for this. (Using MongoDB? My colleague David Murphy wrote a post on this issue here).
We began to see incidents where both Elasticsearch and MongoDB had their datasets removed and replaced with a README/note instructing the user to pay a ransom of 0.2BTC (Bitcoin) to the specified wallet address (if they wanted their data back).MySQL
So is this latest (and similar) attack on MySQL MySQL’s fault? We don’t think so. MySQL and Percona Server® for MySQL by default do not accept authentication from everywhere without a password for the root user.
Let’s go over the various security options MySQL has, and describe some other best practices in order to protect your environment.Default bind_address=127.0.0.1 in Percona Server for MySQL
MySQL currently still binds to 0.0.0.0 (listen to all network interfaces) by default. However, Percona Server for MySQL and Percona XtraDB Cluster have different defaults, and only bind on 127.0.0.1:3306 in its default configuration (Github pull request).
Recall, if you will, CVE-2012-2122. This ALONE should be enough to ensure that you as the administrator use best practices, and ONLY allow access to the MySQL service from known good sources. Do not setup root level or equivalent access from any host (% indicates any host is allowed). Ideally, you should only allow root access from 127.0.0.1 – or if you must, from a subset of a secured network (e.g., 10.10.0.% would only allow access to 10.10.0.0/24).Prevent Access
Also, does the MySQL database really need a publicly accessible IP address? If you do have a valid reason for this, then you should firewall port 3306 and whitelist access only from hosts that need to access the database directly. You can easily use iptables for this.Default Users
MySQL DOES NOT by default create accounts that can be exploited for access. This comes later through an administrator’s lack of understanding, sadly. More often than not, the grant will look something like the following.GRANT ALL PRIVILEGES TO 'root'@'%' IDENTIFIED BY '123456' WITH GRANT OPTION;
You may scoff at the above (and rightly so). However, don’t discount this just yet: “123456” was the MOST USED password in 2016! So it’s reasonable to assume that somewhere out there this is a reality.Max Connection Errors
You can deploy max_connection_errors with a suitably low value to help mitigate a direct attack. This will not prevent a distributed attack, where many thousands of hosts are used. Network isolation is the only way to ensure your mitigation against this attack vector.MySQL 5.7 Improvements on Security Default Root Password
Since MySQL 5.7, a random password is generated for the only root user (root@localhost) when you install MySQL for the first time. That password is then written in the error log and has to be changed. Miguel Ángel blogged about this before.Connection Control Plugin
MySQL 5.7.17 introduced a new open source plugin called Connection Control. When enabled, it delays the authentication of users that failed to login by default more than three times. This is also part as of Percona Server for MySQL 5.7.17.
Here’s an example where the 4th consecutive try caused a one-second delay (default settings were used):$ time mysql -u bleh2 -pbleh ERROR 1045 (28000): Access denied for user 'bleh2'@'localhost' (using password: YES) real 0m0.009s $ time mysql -u bleh2 -pbleh ERROR 1045 (28000): Access denied for user 'bleh2'@'localhost' (using password: YES) real 0m0.008s $ time mysql -u bleh2 -pbleh ERROR 1045 (28000): Access denied for user 'bleh2'@'localhost' (using password: YES) real 0m0.008s $ time mysql -u bleh2 -pbleh ERROR 1045 (28000): Access denied for user 'bleh2'@'localhost' (using password: YES) real 0m1.008s mysql> SELECT * FROM INFORMATION_SCHEMA.CONNECTION_CONTROL_FAILED_LOGIN_ATTEMPTS; +---------------------+-----------------+ | USERHOST | FAILED_ATTEMPTS | +---------------------+-----------------+ | 'bleh2'@'localhost' | 4 | +---------------------+-----------------+ 1 row in set (0.01 sec) Password Validation Plugin
MySQL 5.6.6 and later versions also ship with a password validation plugin, which prevents creating users with unsafe passwords (such as 123456) by ensuring passwords meet certain criteria: https://dev.mysql.com/doc/refman/5.7/en/validate-password-plugin.htmlSummary
In order to get stung, one must ignore the best practices mentioned above (which in today’s world, should take some effort). These best practices include:
Hopefully, these are helpful security tips for MySQL users. Comment below!
This post is part of the MongoDB 3.4 bundled release series of blog posts. In this blog, we’ll go over some useful metrics WiredTiger outputs and how we visualize them in Percona Monitoring and Management (PMM).
WiredTiger is the default storage engine for MongoDB since version 3.2. The addition of this full-featured, comprehensive storage engine offered a lot of new, useful metrics that were not available before in MMAPv1.
Percona Monitoring and Management (PMM)
Percona Monitoring and Management (PMM) is an open-source platform for managing and monitoring MySQL and MongoDB, developed by Percona on top of open-source technology. Behind the scenes, the graphing features this article covers use Prometheus (a popular time-series data store), Grafana (a popular visualization tool), mongodb_exporter (our MongoDB database metric exporter) plus other technologies to provide database and operating system metric graphs for your database instances.
Please see a live demo of our PMM 1.1.1 release of the MongoDB WiredTiger graphs covered in this article: https://pmmdemo.percona.com/graph/dashboard/db/mongodb-wiredtiger.
You can see a sneak peak demo of our Percona Memory Engine graphs we’ll release in PMM 1.1.2 here: https://pmmdemo.percona.com/graph/dashboard/db/mongodb-inmemory.
WiredTiger and Percona Memory Engine
WiredTiger is a storage engine that was developed outside of MongoDB, and was acquired and integrated into MongoDB in version 3.0. WiredTiger offers document-level locking, inline compression and many other useful storage engine features. WiredTiger writes data to disk in “checkpoints” and internally uses Multi-Version Concurrency Control (MVCC) to create “transactions” or “snapshots” when accessing data in the engine. In WiredTiger’s metrics, you will see the term “transactions” used often. It is important to note, however, that MongoDB does not support transactions at this time (this only occurs within the storage engine).
WiredTiger has an in-heap cache for mostly uncompressed pages (50% RAM by default). Like many other engines, it relies on the performance of the Linux filesystem cache, which ends up caching hot, compressed WiredTiger disk blocks.
Besides supporting WiredTiger, Percona Server for MongoDB also ships with a free, open-source in-memory storage engine: Percona Memory Engine for MongoDB. Since we based the Memory Engine on WiredTiger, all graphs and troubleshooting techniques for in-memory are essentially the same (the database data is not stored on disk, of course).
WiredTiger checkpoints data to disk every 60 seconds, or after writing 2GB of journaled data.
PMM graphs the current minimum and maximum checkpoint times for WiredTiger checkpoints in the “WiredTiger Checkpoint Time” graph:
Above I have selected “current,” and we can see we have an average of 176ms checkpoints and over a long period it remains flat, not worsening or “snowballing” each checkpoint (which may indicate a performance issue).
Checkpointing is important to watch because it requires WiredTiger to use system resources, and also can affect query performance in an possibly unexpected way — WiredTiger Cache dirty pages:
The WiredTiger Cache is an LRU cache of mostly uncompressed pages. Like most caches, it creates dirty pages that can take up useful memory until flushed. The WiredTiger Cache uses checkpointing as the point in which it clears dirty pages, making the relationship between dirty pages and checkpointing important to note. WiredTiger cleans dirty pages less often if checkpoint performance is slow. They then can slowly consume more and more of the available cache memory.
In the above graph, we can see on average about 8.8% of the cache is dirty pages with spikes up/down aligning with checkpointing. Systems with a very high rate of dirty pages benefit from more RAM to provide more room for “clean” pages. Another option could be improving storage performance, so checkpoints happen faster.
Similar to InnoDB, WiredTiger uses a system of tickets to control concurrency. Where things differ from InnoDB is both “reads” and “writes” have their own ticket pools with their own maximum-ticket limits. The defaults of “128” tickets for both read and write concurrency is generally enough for even medium-high usage systems. Some systems are capable of more than the default concurrency limit, however (usually systems with very fast storage). Also, concurrency can sometimes reduce overhead on network-based storage.
If you notice higher ticket usage, it can sometimes be due to a lot of single-document locking in WiredTiger. This is something to check if you see high rates alongside storage performance and general query efficiency.
In Percona Monitoring and Management, we have the “WiredTiger Concurrent Transactions” graph to visualize the usage of the tickets. In most cases, tickets shouldn’t reach the limit and you shouldn’t need to tweak this tuneable. If you do require more concurrency, however, PMM’s graphing helps indicate when limits are being reached and whether a new limit will mitigate the problem.
Here we can see a max usage of 8/128 write tickets and 5/128 read tickets. This means this system isn’t having any concurrency issues.
There are several WiredTiger graphs to explain the rate of data moving through the engine. As storage is a common bottleneck, I generally look at “WiredTiger Block Activity” first when investigating storage resource usage. This graph shows the total rates written and read to/from storage by WiredTiger (disk for WiredTiger, memory for in-memory).
For correlation, there are also rates for the amount of data written from and read into the WiredTiger cache, from disk. The “read” metric shows the rate of data added to the cache due to query patterns (e.g.: scanning), while the “written” metric shows the rate of data written out to storage from the WiredTiger cache.
Also there are rates to explain the IO caused by the WiredTiger Log. The metric “payload” is the essentially the write rate of raw BSON pages, and “written” is a combined total of log bytes written (including overhead, likely the frames around the payload, etc.). You should watch changes to the average rate of “read” carefully, as they may indicate changes in query patterns or efficiency.
Detailed Cache Graphs
In addition to the Dirty Pages in the cache graph, “WiredTiger Cache Capacity” graphs the size and usage of the WiredTiger cache:
The rate of cache eviction is graphed in “WiredTiger Cache Eviction,” with a break down of modified vs. unmodified pages:
Very large spikes in eviction can indicate collection scanning or generally poor performing queries. This pushes data out of caches. You should avoid high rates of cache evictions, as they can cause a high overhead to the overall engine.
When increasing the size of the WiredTiger cache it is useful to look at both of the above cache graphs. You should look for more “Used” memory in the “WiredTiger Cache Capacity” graph and less rate of eviction in the “WiredTiger Cache Eviction” graph. If you do not see changes to these metrics, you may see better performance leaving the cache size as-is.
Transactions and Document Operations
The “WiredTiger Transactions” graph shows the overall operations happening inside the engine. All transactions start with a “begin,” and operations that changed data end with a “commit.” Read-only operations show a “rollback” at the time they returned data:
This graph above correlates nicely with the “Mongod – Document Activity” graph, which shows the rate of operations from the MongoDB-layer perspective instead of the storage engine level:
Detailed Log Graphs
The graph “WiredTiger Log Operations” explains activity inside the WiredTiger Log system:
Also, the rate of log record compression is graphed as “WiredTiger Log Records.” WiredTiger only compresses log operations that are greater than 128 bytes, which explains why some log records are not compressed:
In some cases, changes in the ratio of compressed vs. uncompressed pages may help explain changes in CPU% used.
As you’ll see in my other blog post “Percona Monitoring and Management (PMM) Graphs Explained: MongoDB with RocksDB” from this series, RocksDB includes read latency metrics and a hit ratio for the RocksDB block cache. These are two things I would like to see added to WiredTiger’s metric output, and thus PMM. I would also like to improve the user-experience of this dashboard. Some areas use linear-scaled graphs when a logarithmic-scaled graph could provide more value. “WiredTiger Concurrent Transactions” is one example of this.
A known-mystery (so-to-speak) is why WiredTiger reports the cache “percentage overhead” always as 8% in “db.serverStatus().cache.” We added this metric to PMM as a graph named “WiredTiger Cache Overhead.” We assumed it provided a variable overhead metric. However, I’ve seen that it returns 8% regardless of usage: it is 8% on a busy system or even on an empty system with no data or traffic. We’re aware of this, and plan to investigate, as a hit ratio for the cache is a very valuable metric:
Also, if you’ve ever seen the full output of the WiredTiger status metrics (‘db.serverStatus().wiredTiger’ in Mongo shell), you’ll know that there are a LOT more WiredTiger metrics than are currently graphed in Percona Monitoring and Management. In our initial release, we’ve aimed to only include high-value graphs to simplify monitoring WiredTiger. A major barrier in our development of monitoring features for WiredTiger has been the little-to-no documentation on the meaning of many status metrics. I hope this improves with time. As we understand more correlations and useful metrics to determine the health of WiredTiger, we plan to integrate those into Percona Monitoring and Management in the future. As always, we appreciate your suggestions.
Lastly, look out for an upcoming blog post from this series regarding creating custom dashboards, graphs and raw data queries with Percona Monitoring and Management!
For general inquiries, please send us your question and someone will contact you.