This is a 3 post series on querying Kaggle data on EMR cluster. I will be using Apache Zeppein for the data exploration, and internally using Apache Spark for the query execution.
Most of the complications would be hidden from us and Amazon EMR is going to take care of it.
Here are the 3 posts for our task:
- Part 1: How to copy Kaggle data to Amazon S3
- Part 2: How to create EMR cluster with Apache Spark and Apache Zeppelin
- Part 3: Query Kaggle data via Apache Zeppelin
I have provided examples and complete walk though on the steps involved for the task. I hope the post is helpful.
Cheers