Building a scalable time-series database on PostgreSQL
Today everything is instrumented, generating more and more time-series data streams that need to be monitored and analyzed. When it comes to storing this data, many developers often start with some well-trusted system like PostgreSQL, but when their data hits a certain scale, give up its query power and ecosystem by migrating to some NoSQL or other "modern" time-series architecture. They face the traditional trade-off: query power or scale. In this talk, I describe why this perceived trade-off isn't necessary, and how we've built an efficient, scalable time-series database engineered up from PostgreSQL. In particular, the nature of time-series workloads one finds in devops, monitoring, IoT, finance, and elsewhere -- inserting new data about recent events -- presents very different demands than general transactional (OLTP) workloads. We've architected our time-series database to take advantage of and embrace these differences. The system architecture automatically partitions data across both time and space, even though it exposes the illusion of a single continuous table -- a hypertable -- across all of your data spread across one or many servers. Its distributed query optimizations both hide the fact that users are interacting with many “chunks” of data, which are right-sized by volume and time constraints, and minimize which and how chunks are accessed to answer queries. In fact, the database supports "full SQL" against this hypertable (e.g., secondary indexes, rich query predicates and group bys, aggregations, windowing functions, CTEs, JOINs). Through performance benchmarks, I show how the database scales much better than PostgreSQL, even on a single node. In particular, by appropriately sizing chunks, it avoids the "performance cliff" that vanilla PostgreSQL experiences once reaching table sizes of 10s-100s of millions of rows, while offering some compelling query performance improvements. The database is implemented as a PostgreSQL extension, released under the Apache 2 license. A single-node beta release is available on GitHub, with the clustered version under development.
Co-founder/CTO, Timescale - Professor of Computer Science, TimescaleDB
Michael J. Freedman is a Professor in the Computer Science Department at Princeton University, as well as the co-founder and CTO of Timescale, building an open-source database that scales out SQL for time-series data. His work broadly focuses on distributed systems, networking, and security. Freedman developed and operated several self-managing systems -- including CoralCDN, a decentralized content distribution network, and DONAR, a server resolution system that powered the FCC's Consumer Broadband Test -- which reached millions of users daily. Freedman's work on IP geolocation and intelligence led him to co-found Illuminics Systems, which was acquired by Quova (now part of Neustar) in 2006. His work on programmable enterprise networking (Ethane) helped form the basis for the OpenFlow / software-defined networking (SDN) architecture. Freedman is also a technical advisor to Blockstack, building decentralized services leveraging the blockchain. Honors include a Presidential Early Career Award for Scientists and Engineers (PECASE, given by President Obama), Sloan Fellowship, NSF CAREER Award, Office of Naval Research Young Investigator Award, DARPA Computer Science Study Group membership, and multiple award publications. Prior to joining Princeton in 2007, he received his Ph.D. in computer science from NYU's Courant Institute and his S.B. and M.Eng. degrees from MIT.