Welcome to the next Percona Live featured talk with Percona Live Data Performance Conference 2016 speakers! In this series of blogs, we’ll highlight some of the speakers that will be at this year’s conference, as well as discuss the technologies and outlooks of the speakers themselves. Make sure to read to the end to get a special Percona Live registration bonus!
In this Percona Live featured talk, we’ll meet Anastasia Ailamaki, Professor and CEO, EPFL and RAW Labs. Her talk will be RAW: Fast queries on JIT databases. RAW is a query engine that reads data in its raw format and processes queries using adaptive, just-in-time operators. The key insight is its use of virtualization and dynamic generation of operators. I had a chance to speak with Anastasia and learn a bit more about RAW and JIT databases:
Percona: Give me a brief history of yourself: how you got into database development, where you work, what you love about it.
Anastasia: I am a computer engineer and initially trained on networks. I came across databases in the midst of the object-oriented hype — and was totally smitten by both the power of data models and the wealth of problems one had to solve to create a functioning and performant database system. In the following years, I built several systems as a student and (later) as a coder. At some point, however, I needed to learn more about the machine. I decided to do a Masters in computer architecture, which led to a Ph.D. in databases and microarchitecture. I became a professor at CMU, where for eight years I guided students as they built their ideas into real systems that assessed their ideas potential and value. During my sabbatical at EPFL, I was fascinated by the talent and opportunities in Switzerland — I decided to stay and, seven years later, co-founded RAW Labs.
Percona: Your talk is going to be on “RAW: Fast queries on JIT databases.” Would you say you’re an advocate of abandoning (or at least not relying on) the traditional “big structured database accessed by queries” model that have existed for most of computing? Why?
Anastasia: The classical usage paradigm for databases has been “create a database, then ask queries.” Traditionally, “creating a database” means creating a structured copy of the entire dataset. This is now passé for the simple reason that data is growing too fast, and loading overhead grows with data size. What’s more, we typically use only a small fraction of the data available, and investing in the mass of owned data is a waste of resources — people have to wait too long from the time they receive a dataset until they can ask a query. And it doesn’t stop there: the users are asked to pick a database engine based on the format and intended use of the data. We associate row stores to transactions, NoSQL to JSON, and column stores to analytics, but true insight comes from combining all of the data semantically as opposed to structurally. With each engine optimizing for specific kinds of queries and data formats, analysts subconsciously factor in limitations when piecing together their infrastructure. We only know the best way to structure data when we see the queries, so loading data and developing query processing operators before knowing the queries is premature.
Percona: What are the conditions that make JIT databases in general (and RAW specifically) the optimum solution?
Anastasia: JIT databases push functionality to the last minute, and execute it right when it’s actually needed. Several systems perform JIT compilation of queries, which offer great performance benefits (an example is Hyper, a system recently acquired by Tableau). RAW is JIT on steroids: it leaves data at its source and only reads it or asks for any system resources when they’re actually required. You may have 10000 files, and a file will only be read when you ask a query that needs the data in it. With RAW, when the user asks a query the RAW code-generates raw source data adaptors and the entire query engine needed to run the query. It stores all useful information about the accessed data, as well as popular operators generated in the past, and uses them to accelerate future queries. It adapts to system resources on the fly and only asks for them when needed. RAW is an interface to raw data and operational databases, and uses them to accelerate future queries. It adapts to system resources on the fly and only asks for them when needed. In addition, the RAW query language is incredibly rich; it is a superset of SQL which allows navigation on hierarchical data and tables at the same time, with support for variable assignments, regular expressions, and more for log processing — while staying in declarative land. Therefore, the analysts only need to describe the desired result in SQL, without thinking of data format.
Percona: What would you say in the next step for JIT and RAW? What keeps you up at night concerning the future of this approach?
Anastasia: The next step for RAW is to reach out to as many people as possible — especially users with complex operational data pipelines — and reduce cost and eliminate pipeline stages, unneeded data copies, and extensive scripting. RAW is a new approach that can work with existing infrastructures in a non-intrusive way. We are well on our way with several proof-of-concept projects that create verticals for RAW, and demonstrate its usefulness for different applications.
Percona: What are you most looking forward to at Percona Live Data Performance Conference 2016?
Anastasia: I am looking forward to meeting as many users and developers as possible, hearing their feedback on RAW and our ideas, and learning from their experiences.
You can read more about RAW and JIT databases at Anastasia’s academic group’s website: dias.epfl.ch.
Want to find out more about Anastasia and RAW? Register for Percona Live Data Performance Conference 2016, and see her talk RAW: Fast queries on JIT databases. Use the code “FeaturedTalk” and receive $100 off the current registration price!
Percona Live Data Performance Conference 2016 is the premier open source event for the data performance ecosystem. It is the place to be for the open source community as well as businesses that thrive in the MySQL, NoSQL, cloud, big data and Internet of Things (IoT) marketplaces. Attendees include DBAs, sysadmins, developers, architects, CTOs, CEOs, and vendors from around the world.
The Percona Live Data Performance Conference will be April 18-21 at the Hyatt Regency Santa Clara & The Santa Clara Convention Center.