Scaling Percona Monitoring and Management (PMM)

Scaling Percona Monitoring and Management (PMM)

PREVIOUS POST
NEXT POST

Starting with PMM 1.13,  PMM uses Prometheus 2 for metrics storage, which tends to be heaviest resource consumer of CPU and RAM.  With Prometheus 2 Performance Improvements, PMM can scale to more than 1000 monitored nodes per instance in default configuration. In this blog post we will look into PMM scaling and capacity planning—how to estimate the resources required, and what drives resource consumption.

PMM tested with 1000 nodes

We have now tested PMM with up to 1000 nodes, using a virtualized system with 128GB of memory, 24 virtual cores, and SSD storage. We found PMM scales pretty linearly with the available memory and CPU cores, and we believe that a higher number of nodes could be supported with more powerful hardware.

What drives resource usage in PMM ?

Depending on your system configuration and workload, a single node can generate very different loads on the PMM server. The main factors that impact the performance of PMM are:

  1. Number of samples (data points) injected into PMM per second
  2. Number of distinct time series they belong to (cardinality)
  3. Number of distinct query patterns your application uses
  4. Number of queries you have on PMM, through the user interface on the API, and their complexity

These specifically can be impacted by:

  • Software version – modern database software versions expose more metrics)
  • Software configuration – some metrics are only exposed in certain configuration
  • Workload – a large number of database objects and high concurrency will increase both the number of samples ingested and their cardinality.
  • Exporter configuration – disabling collectors can reduce amount of data collectors
  • Scrape frequency –  controlled by METRICS_RESOLUTION

All these factors together may impact resource requirements by a factor of ten or more, so do your own testing to be sure. However, the numbers in this article should serve as good general guidance as a start point for your research.

On the system supporting 1000 instances we observed the following performance:

Performance PMM 1000 nodes load

As you can see, we have more than 2.000 scrapes/sec performed, providing almost two million samples/sec, and more than eight million active time series. These are the main numbers that define the load placed on Prometheus.

Capacity planning to scale PMM

Both CPU and memory are very important resources for PMM capacity planning. Memory is the more important as Prometheus 2 does not have good options for limiting memory consumption. If you do not have enough memory to handle your workload, then it will run out of memory and crash.

We recommend at least 2GB of memory for a production PMM Installation. A test installation with 1GB of memory is possible. However, it may not be able to monitor more than one or two nodes without running out of memory. With 2GB of memory you should be able to monitor at least five nodes without problem.

With powerful systems (8GB of more) you can have approximately eight systems per 1GB of memory, or about 15,000 samples ingested/sec per 1GB of memory.

To calculate the CPU usage resources required, allow for about 50 monitored systems per core (or 100K metrics/sec per CPU core).

One problem you’re likely to encounter if you’re running PMM with 100+ instances is the “Home Dashboard”. This becomes way too heavy with such a large number of servers. We plan to fix this issue in future releases of PMM, but for now you can work around it in two simple ways:

You can select the host, for example “pmm-server” in your home dashboard and save it, before adding a large amount of hosts to the system.

set home dashboard for PMM

Or you can make some other dashboard of your choice and set it as the home dashboard.

Summary

  • More than 1,000 monitored systems is possible per single PMM server
  • Your specific workload and configuration may significantly change the resources required
  • If deploying with 8GB or more, plan 50 systems per core, and eight systems per 1GB of RAM
PREVIOUS POST
NEXT POST

Share this post

Comments (5)

  • Renan Benedicto Pereira Reply

    Hi Peter, nice post!
    You’ve tested monitoring a mix of Percona Server, RDS, XtraDB Cluster, etc or this tests was based just with a kind of database instance ?

    September 28, 2018 at 1:29 pm
    • Peter Zaitsev Reply

      This particular test was done on Percona Server 5.7 with recommended settings. You should have similar capacity with Percona XtraDB Cluster. RDS uses agent-less monitoring which places more load on the PMM Server though I know of installations having hundreds of Amazon Aurora instances monitored by single PMM server

      September 29, 2018 at 10:07 am
  • 菜狗 Reply

    If pmm does the alarm function, it will be even more powerful. Otherwise, we have to establish a separate monitoring system to make an alarm.(google translate)

    September 29, 2018 at 3:55 am
    • Peter Zaitsev Reply

      Thank you for your feedback. At this point you can use Grafana Alerting with PMM which I admit is pretty basic. In the future versions we plan to simplify integration with Prometheus Alertmanager for more advanced Alerting functionality.

      September 29, 2018 at 10:08 am
  • Judita De Guzman Reply

    I’m very impressed with this article it’s very helpful! Thank you, Peter.

    November 14, 2018 at 10:16 pm

Leave a Reply