This is part-1 of the blog series — How to analyze Kaggle data with Apache Spark and Zeppelin. This post provides a brief description on how to copy data from Kaggle to Amazon S3.
Env details:
- Ubuntu
- Python 3.6.3
Steps
We need these steps for our task –
- Install kaggle cli and aws cli.
- Download file from Kaggle to your local box
- Copy local file to Amazon S3.
Getting data from Kaggle
Install Kaggle-cli
In order to download data from Kaggle we need to install kaggle-cli. Use this command to install kaggle-cli-
pip install kaggle
Generate API tokens
After installing kaggle-cli, go to your account and then create API tokens.
You can create a new API token from your Kaggle > Account section.
This downloads a kaggle.json file with the API tokens in it. We need to move this file to – ~/.kaggle/kaggle.json.
Note: Please make sure to not share the tokens with anyone. These are secrets.
Lets create a folder in home directory for our kaggle.json file, and move our file there-
mkdir ~/.kaggle mv <source path> <destination path> mv ~/Downloads/kaggle.json ~/.kaggle/kaggle.json
We might get this error if the file permissions are too relaxed –
Your kaggle API key is readable by other users on this system.
To fix this issue you can run following command:
chmod 600 ~/.kaggle/kaggle.json
Now, copy command from API for kernel dataset which you want to download and paste it on terminal.
Download dataset from Kaggle
I am downloading the transactions-from-a-bakery data set from Kaggle. We can get the download command from Kaggle dataset’s Download section –
kaggle datasets download -d xvivancos/transactions-from-a-bakery
This downloads your Kaggle dataset in your system. Now, we have to copy this file to aws S3.
Copy dataset to S3
Install AWS Cli
Inorder to upload the file to S3 we need to install AWS Cli.
pip install awscli --upgrade --user
Generate S3 AWS secret
We need to save the AWS secret keys to use our awscli. The new paisr of keys can be created from AWS Web interface.
AWS > IAM > Users (from left navigation) > Select User > Security Credentials >Access Keys >
Click on the create Access Keys button to generate the new Access key for our user.
Save the credentials as a file on the location – ~/.aws/credentials.
This is how your file should look like –
Get more information on IAM credentials here, and here.
Create S3 bucket
After installing aws cli, we need to create S3 bucket.
After creating S3 bucket, you need to upload file there.
Upload dataset to S3 bucket
aws s3 cp <local file path> <s3 path> aws s3 cp ~/nhl-game-data.zip s3://nikita-ds-playground/data/kaggle/nhl-game/
Thats it for this post, now this data can be used for analysis purposes from S3 location. In future post we will learn how to use this data for analysis on AWS stack.
Next : Create an EMR cluster with Apache Spark and Apache Zeppelin.