This is part-3 of the blog series – How to analyze Kaggle data with Apache Spark and Zeppelin. In the first part we saw how to copy Kaggle data to Amazon S3, and in the second part we Created an EMR cluster with Apache Spark and Zeppelin. In this post we will be querying the data via Zeppelin.
Organize the data
Find the data
As part of Post 1, we got the data from Kaggle and saved it on Amazon S3 in location –
However this data is a Zip file of multiple files. Lets get individual files to create Hive/Spark tables on the data.
Copy zip file to local box
$ aws s3 ls s3://nikita-ds-playground/
data/kaggle/nhl-game/ 2018-10-28 13:00:42 125822231 nhl-game-data.zip $ cd /tmp $ aws s3 cp s3://nikita-ds-playground/ data/kaggle/nhl-game/nhl-game- data.zip . download: s3://nikita-ds-playground/ data/kaggle/nhl-game/nhl-game- data.zip to ./nhl-game-data.zip
$ unzip nhl-game-data.zip Archive: nhl-game-data.zip inflating: game.csv inflating: game_plays_players.csv inflating: player_info.csv inflating: game_plays.csv inflating: game_skater_stats.csv inflating: game_goalie_stats.csv inflating: game_shifts.csv inflating: game_teams_stats.csv inflating: table_relationships.JPG inflating: team_info.csv
Copy back to s3
$ aws s3 cp ./game.csv s3://nikita-ds-playground/
data/kaggle/nhl-game/game/ $ aws s3 cp ./team_info.csv s3://nikita-ds-playground/ data/kaggle/nhl-game/team_ info/
Query data via Zeppelin
We have discussed all about getting Zeppelin running on EMR cluster as part of post 2. We should be able to access Zeppelin via our browser at:
We can now create an External table on our data. We need the column information for creating the table DDL. We can get this information from Kaggle –
Lets use the column info and create our External table DDL –
CREATE EXTERNAL TABLE IF NOT EXISTS datascience.kaggle_nhl_game ( -- get column info from kaggle game_id STRING, season STRING, type STRING, date_time STRING, away_team_id STRING, home_team_id STRING, away_goals INT, home_goals INT, outcome STRING, home_rink_side_start STRING, venue STRING, venue_link STRING, venue_time_zone_id STRING, venue_time_zone_offset STRING, venue_time_zone_tz STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION 's3://nikita-ds-playground/
Voila. We can now query our Kaggle data via SQL.
Note: Not only SQL. We can use Python to query our data via Zeppelin as well. Learn more about Zeppelin integrations and language support.
Thats all for this post. I hope the post series was helpful. Cheers.
I am a greenhorn Data Science student with interest in finding patterns in data. My language of choice is Python and I am starting to get my hands dirty with R.
I blog on Medium.com  and ConfusedCoders.com . I share my code on Github.com .