Solr: How to index 10 billion phrases from MySQL and HBase
Seznam.cz is the largest and the most visited web portal and search engine in the Czech Republic. It is one of a few search engines in the World which successfully competes with Google in the field of local full-text search. Besides the Search engine, Seznam runs over 40 different web services such as news portals, map portal, email service and many more. Thanks to various tasks we have to solve many challenges that require integration of full-text indexers such as Elasticsearch, Sphinx and Solr. As for full-text indexers the most challenging project was one of new features of our PPC advertising system Sklik.cz (similar to Google AdWords). The main target was to provide real-time full-text searching optimized for high volume traffic to allow our customers manage their advertising more effectively. The system must return results immediately after customer inputs two or more characters that match any substring in our text data. This project required indexing almost 10 billion phrases. Phrase is a bulk of tokens. One part of the phrases is stored in MySQL cluster inside different sets of tables and the second part is stored in our data warehouse implemented via HBase. Beside searching, the system has to perform near real-time indexing in the background as the data grow continuously in the databases. While describing this example project we will introduce Apache Solr for users who haven’t had a chance to meet this full-text indexer yet. We will focus on Solr’s architecture, Lucene, indexing strategies, performance tuning, SolrCloud, RDBMS(MySQL) and HBase synchronization, HDFS and Hadoop integration and more. During the presentation we will compare Solr with Elasticsearch and Sphinx. Attendees will gain basic knowledge of this full-text indexer inside modern and scalable database ecosystems.
Senior developer, Seznam.cz
Miroslav is an experienced developer and a data specialist in the largest Czech web service company Seznam.cz. He started working with web technologies and MySQL, moved to Python and MongoDB and now he is playing with data in Hadoop, Impala and Solr.
SW Architect, Seznam.cz
Tomas is a big-data architect and a database specialist in a Czech company called Seznam.cz. He has over eight years’ experience with design, development and optimization of database systems while focusing on MySQL, Hbase, Impala, Solr and Hive. He organizes MySQL and Hadoop trainings and workshops for his colleagues at Seznam and externally for other companies and Czech universities.