Feb 04, 2017
 Fosdem 17
  Alexander Rubin

Apache Spark is a cluster computing framework, similar to Apache Hadoop. There are a number of tasks where MySQL does not show great performance: for example MySQL is not massively parallel system and a single query will only utilize 1 CPU core . Spark, on the the other hand is designed to be massively parallel; in addition Spark is a clustering framework, so you can easily add more compute nodes so that Spark can utilize more resources and scale.

Apache Drill is similar project aimed to make data discovery easier. For example it allow you to join data sources in MySQL, MongoDB, flat files, other RDBMS, etc.

In this talk I will demonstrate how to use Apache Spark together with MySQL for data analysis. I will sho how Apache Spark aggregates data (wikipedia pageview statistics) and stores the resultset in MySQL. I will also show how to use Apache Spark with multiple sources and join virtual tables from MySQL, flat files and even MongoDB.

About the Author

Alexander Rubin

Alexander joined Percona in 2013. Alexander worked with MySQL since 2000 as DBA and Application Developer. Before joining Percona he was doing MySQL consulting as a principal consultant for over 7 years (started with MySQL AB in 2006, then Sun Microsystems and then Oracle). He helped many customers design large, scalable and highly available MySQL systems and optimize MySQL performance. Alexander also helped customers design Big Data stores with Apache Hadoop and related technologies.