MySQL and Impala as SQL friendly lambda architecture
Seznam.cz is the largest and the most visited web portal and search engine in the Czech Republic. It is one of a few search engines in the World which successfully competes with Google in the field of local full-text search. Besides the Search engine, Seznam runs over 40 different web services such as News portals, Map portal, Email service and many more. Thanks to various services we have many projects where we need different data warehouses. Presented data warehouse is designed for PPC advertising system Sklik.cz to provide an internal BI and real-time warehouse tool. Our warehouse had been historically implemented as one MySQL instance maintaining tens of billions of rows inside many tables. Critical analytical queries run up to hours. We had to choose an appropriate open-source solution which would provide query acceleration, nearly 100% SQL compatibility, easy scaling and long-term potential and near real-time changes in data (to provide fast feedback for our other systems). We have intensively tested Apache Hive and Cloudera Impala. We selected Cloudera Impala and successfully migrated critical parts of the original data warehouse into it. Impala is designed to execute data queries in Hadoop in real time via well known SQL standard. Impala fits very well into our existing Hadoop and MySQL ecosystem. By using Impala together with MySQL, Apache Sqoop and our own ORC/Parquet convertor, we created a modern lambda architecture solution which allows transactional, real-time DDL statement executions as well as long-term analytical querying and batch processing. We are continuously improving the system via recently introduced Apache OCR format which brings us another major speed improvement. During this presentation we will introduce Impala to users who haven’t had a chance to meet this distributed BI tool yet (the whole presentation will be conceived from MySQL users’ point of view). We will describe the difference between Hive and Impala and explain what is a lambda architecture and why it is useful. We will focus on Impala architecture, how Impala runs different types of queries (Impala and MySQL comparison), briefly on database scheme proposal, how Impala fits into the Hadoop ecosystem, how to choose proper data storage (HDFS - text, Parquet, Avro, ORC, HBase or Kudu), tuning types, best ways how to import data and other topics. At the end we will mention possible competitive solutions such as Apache Drill, Shark, Kylin and Druid.
Lukas is a senior developer and team lead in czech company Seznam.cz. He has 10 years of experience with development and design scalable architecture solutions. He focuses on architecture, tests, and continuous delivery.
Senior BigData developer, Seznam.cz
Michal is a senior developer and database specialist in czech company Seznam.cz. He has 5 years of experience with design, development and maintenance of complex PPC system based on MySQL (Percona, MariaDB), Hadoop and HBase and he works with Java, Python and C++ on daily basis. He focuses on designing solutions that are as fast and effective as possible. Michal also regularly presents successful projects at Czech universities.