At ITSumma, we provide 24/7 site reliability engineering for more than 300 clients with 10,000+ servers in total, collecting over 200 thousand metrics per second.
In 2010, we realized that existing monitoring systems could not handle our requirements. What we needed was the capability to instantly process and display analytics, store a minimum of 1 year's worth of data in 15-second, (better yet 1-second) intervals, and make quick-fire (as little as 200-millisecond) queries to retrieve high-resolution data snapshots.
That's why we developed our own monitoring system, and it worked well with the infrastructure of that time. In 2018, our system could no longer meet the requirements of new infrastructures, and had outlived its usefulness in some ways.
Since late 2018, we have been developing a new monitoring system.
To assist us with this project, we compared several major solutions for storing time-series data, including Prometheus storage, InfluxDB, Cassandra, Clickhouse and others.
We investigated their capabilities with our production data in terms of performance, stability, scalability, and storage usage.
At Percona Live I would like to present our findings and show the results of our production and performance tests which we consider useful for anyone interested in storing massive amounts of time series data.
Evgeny Potapov is the founder and CEO of ITSumma.
Today, 80 experts work for ITSumma from offices in Moscow, Saint Petersburg, and the Siberian city of Irkutsk. The ITSumma team provides 24/7 support and site reliability engineering for national media, retail and service giants such as TASS, the Russian national news agency; Mvideo, the largest consumer electronic retail chain in Russia; and S7 Airlines and Ural Airlines, the 2nd and 4th largest airlines in Russia, respectively. Every day, 150 million people visit the websites ITSumma supports.