Part 3: Query Kaggle data via Apache Zeppelin

This is part-3 of the blog series – How to analyze Kaggle data with Apache Spark and Zeppelin. In the first part we saw how to copy Kaggle data to Amazon S3, and in the second part we Created an EMR cluster with Apache Spark and Zeppelin. In this post we will be querying the data via Zeppelin.

Organize the data

Find the data

As part of Post 1, we got the data from Kaggle and saved it on Amazon S3 in location –

s3://nikita-ds-playground/data/kaggle/nhl-game/

However this data is a Zip file of multiple files. Lets get individual files to create Hive/Spark tables on the data.

Copy zip file to local box

Lets copy the zip file to local, to unzip and re-upload to S3.
$ aws s3 ls s3://nikita-ds-playground/data/kaggle/nhl-game/
2018-10-28 13:00:42  125822231 nhl-game-data.zip
$ cd /tmp

$ aws s3 cp s3://nikita-ds-playground/data/kaggle/nhl-game/nhl-game-data.zip .
download: s3://nikita-ds-playground/data/kaggle/nhl-game/nhl-game-data.zip to ./nhl-game-data.zip

Unzip file

$ unzip nhl-game-data.zip 
Archive:  nhl-game-data.zip
  inflating: game.csv                
  inflating: game_plays_players.csv  
  inflating: player_info.csv         
  inflating: game_plays.csv          
  inflating: game_skater_stats.csv   
  inflating: game_goalie_stats.csv   
  inflating: game_shifts.csv         
  inflating: game_teams_stats.csv    
  inflating: table_relationships.JPG  
  inflating: team_info.csv

Copy back to s3

The zip is composed of a lot of files. Lets copy few of them to S3.
Note: we need to copy each file to it’s own S3 Path prefix.
$ aws s3 cp ./game.csv s3://nikita-ds-playground/data/kaggle/nhl-game/game/
$ aws s3 cp ./team_info.csv s3://nikita-ds-playground/data/kaggle/nhl-game/team_info/

 

Query data via Zeppelin

Access Zeppelin

We have discussed all about getting Zeppelin running on EMR cluster as part of post 2. We should be able to access Zeppelin via our browser at:

http://<master-ip-address>:8890/

Create table

We can now create an External table on our data. We need the column information for creating the table DDL. We can get this information from Kaggle –

 

 

Lets use the column info and create our External table DDL –

CREATE EXTERNAL TABLE IF NOT EXISTS datascience.kaggle_nhl_game
(
-- get column info from kaggle
game_id STRING,
season STRING,
type  STRING,
date_time STRING,
away_team_id STRING,
home_team_id STRING,
away_goals INT,
home_goals INT,
outcome STRING,
home_rink_side_start STRING,
venue STRING,
venue_link STRING,
venue_time_zone_id STRING,
venue_time_zone_offset STRING,
venue_time_zone_tz STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://nikita-ds-playground/data/kaggle/nhl-game/game/';

 

 

Query data

We can now query the data via Zeppelin

 

 

Voila. We can now query our Kaggle data via SQL.

Note: Not only SQL. We can use Python to query our data via Zeppelin as well. Learn more about Zeppelin integrations and language support.

Thats all for this post. I hope the post series was helpful. Cheers.

Leave a Reply

Your email address will not be published. Required fields are marked *