This is part-3 of the blog series – How to analyze Kaggle data with Apache Spark and Zeppelin. In the first part we saw how to copy Kaggle data to Amazon S3, and in the second part we Created an EMR cluster with Apache Spark and Zeppelin. In this post we will be querying the data via Zeppelin.
Organize the data
Find the data
As part of Post 1, we got the data from Kaggle and saved it on Amazon S3 in location –
s3://nikita-ds-playground/data/kaggle/nhl-game/
However this data is a Zip file of multiple files. Lets get individual files to create Hive/Spark tables on the data.
Copy zip file to local box
$ aws s3 ls s3://nikita-ds-playground/data/kaggle/nhl-game/ 2018-10-28 13:00:42 125822231 nhl-game-data.zip $ cd /tmp $ aws s3 cp s3://nikita-ds-playground/ data/kaggle/nhl-game/nhl-game- data.zip . download: s3://nikita-ds-playground/ data/kaggle/nhl-game/nhl-game- data.zip to ./nhl-game-data.zip
Unzip file
$ unzip nhl-game-data.zip Archive: nhl-game-data.zip inflating: game.csv inflating: game_plays_players.csv inflating: player_info.csv inflating: game_plays.csv inflating: game_skater_stats.csv inflating: game_goalie_stats.csv inflating: game_shifts.csv inflating: game_teams_stats.csv inflating: table_relationships.JPG inflating: team_info.csv
Copy back to s3
$ aws s3 cp ./game.csv s3://nikita-ds-playground/data/kaggle/nhl-game/game/ $ aws s3 cp ./team_info.csv s3://nikita-ds-playground/ data/kaggle/nhl-game/team_ info/
Query data via Zeppelin
Access Zeppelin
We have discussed all about getting Zeppelin running on EMR cluster as part of post 2. We should be able to access Zeppelin via our browser at:
http://<master-ip-address>:8890/
Create table
We can now create an External table on our data. We need the column information for creating the table DDL. We can get this information from Kaggle –
Lets use the column info and create our External table DDL –
CREATE EXTERNAL TABLE IF NOT EXISTS datascience.kaggle_nhl_game ( -- get column info from kaggle game_id STRING, season STRING, type STRING, date_time STRING, away_team_id STRING, home_team_id STRING, away_goals INT, home_goals INT, outcome STRING, home_rink_side_start STRING, venue STRING, venue_link STRING, venue_time_zone_id STRING, venue_time_zone_offset STRING, venue_time_zone_tz STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION 's3://nikita-ds-playground/data/kaggle/nhl-game/game/';
Query data
Voila. We can now query our Kaggle data via SQL.
Note: Not only SQL. We can use Python to query our data via Zeppelin as well. Learn more about Zeppelin integrations and language support.
Thats all for this post. I hope the post series was helpful. Cheers.