There are a lot of things I love about Prometheus; its data model is fantastic for monitoring applications and PromQL language is often more expressive than SQL for data retrieval needs you have in the observability space. One thing, though, I hate about Prometheus with a deep passion is the behavior of its rate() and similar functions, deeply rooted in the Prometheus computational model, which I was told by the development team is not likely to change.
So What’s the Problem, and Why is it Such a Big Deal?
First – the problem. rate() functions give you the rate of change of the time series for the Interval supplied, so rate(mysql_global_status_questions[10s]) will basically give us the average number of MySQL questions over the last 10seconds. Everything is great so far.
But what if the resolution of this time series is lower than 10 seconds, for example, if we take mysql_global_status_questions measurement only every minute? In this case, rate() function will return nothing and data will disappear from the graph.
What would I like to see instead? Give the common sense answer! If I tell you MySQL Question was 1M at 0:00 and 2M at 10:00, and ask you what the average number of queries per second was from 4:00 to 5:00, you will just use the best estimate you have available and give the average based on the data available.
Of course, such an approach is not without its problems, for example, it is possible MySQL actually went to 10M queries at 5:00 and when was restarted it went to 2M, and then the data will be wrong; yet I believe for most cases having such data is more preferred to having no data available.
One “solution” Prometheus provides to this problem is irate() function which gives you the “instant rate” based on the last two data points in time series. If you use irate() with a large enough interval, you can avoid getting “no data” but you get into another problem: you’ll be getting very volatile data based on two measurements, which, while of less volatile value, are smoothed over a longer period of time and might be desired.
Another problem with irate() is that only rate() function has such a corresponding function, while other functions such as avg_over_time() or max_over_time() do not have any great options.
One solution, which is often recommended, is to just build your dashboards to match the data capture resolution so you can’t get into such situations.
This is a non-starter for our use case at Percona.
We use Prometheus as a key component in Percona Monitoring and Management (PMM) and the data capture resolution is highly configurable, and so can be different in the different periods of time and different time series in the system. Additionally, most of the dashboards we provide are dynamic, using a lower averaging period as you “zoom in” to the data.
VictoriaMetrics to the Rescue
VictoriaMetrics is a Time Series Database which can be connected to Prometheus using the RemoteWrite backend. It implements Read API, which is mostly compatible with Prometheus as well as MetricsQL, which is mostly compatible with PromQL and offers some additional language features.
VictoriaMetrics has other advantages compared to Prometheus, ranging from massively parallel operation for scalability, better performance, and better data compression, though what we focus on for this blog post is a rate() function handling.
VictoriaMetrics handles rate() function in the common sense way I described earlier!
Let’s take a look at the difference in practice. Here I am using a prototype build of Percona Monitoring and Management with VictoriaMetrics. In the “Questions” panel we use the needlessly complicated formula:
Which we can simplify to the “common sense” formula we’d like to use, without workarounds required:
Let’s now compare the graphs between Prometheus (Left) and VictoriaMetrics(Right)
For 1 hour range, we get high enough resolution for both Prometheus and VictoriaMetrics display data. The differences in the graphs come from the fact it is two separate instances running similar workloads rather than the same data in both data stores.
In this case, as you can see, Prometheus shows no data while VictoriaMetrics provides data even if the attempted resolution is 1sec and data is available with only 5 seconds resolution.
We’re very early in the process evaluating VictoriaMetrics but I’m super thrilled it solves this very annoying problem we have with Prometheus query handling. I wonder if this is a problem for you as well, and if you too find VictoriaMetrics behavior more user-friendly or if Prometheus’ behavior is preferred in your environment.