Part 1: How to copy Kaggle data to Amazon S3

This is part-1 of the blog series — How to analyze Kaggle data with Apache Spark and Zeppelin. This post provides a brief description on how to copy data from Kaggle to Amazon S3.

Env details:

  • Ubuntu
  • Python 3.6.3

Steps

We need these steps for our task –

  1. Install kaggle cli and aws cli.
  2. Download file from Kaggle to your local box
  3. Copy local file to Amazon S3.

Getting data from Kaggle

Install Kaggle-cli

In order to download data from Kaggle we need to install kaggle-cli. Use this command to install kaggle-cli-

pip install kaggle

Generate API tokens

After installing kaggle-cli, go to your account  and then create API tokens.

You can create a new API token from your Kaggle > Account section.

 

 

This downloads a kaggle.json file with the API tokens in it. We need to move this file to – ~/.kaggle/kaggle.json.

Note: Please make sure to not share the tokens with anyone. These are secrets.

Lets create a folder in home directory for our kaggle.json file, and move our file there-

mkdir ~/.kaggle
mv <source path> <destination path>
mv ~/Downloads/kaggle.json ~/.kaggle/kaggle.json

We might get this error if the file permissions are too relaxed –

Your kaggle API key is readable by other users on this system.

To fix this issue you can run following command:

chmod 600 ~/.kaggle/kaggle.json

Now, copy command from API for kernel dataset which you want to download and paste it on terminal.

Download dataset from Kaggle

I am downloading the transactions-from-a-bakery data set from Kaggle. We can get the download command from Kaggle dataset’s Download section  –

 

kaggle datasets download -d xvivancos/transactions-from-a-bakery

This downloads your Kaggle dataset in your system. Now, we have to copy this file to aws S3.

Copy dataset to S3

Install AWS Cli

Inorder to upload the file to S3 we need to install AWS Cli.

pip install awscli --upgrade --user

Generate S3 AWS secret

We need to save the AWS secret keys to use our awscli. The new paisr of keys can be created from AWS Web interface.

AWS > IAM > Users (from left navigation) > Select User > Security Credentials >Access Keys >

Click on the create Access Keys button to generate the new Access key for our user.

 

Save the credentials as a file on the location – ~/.aws/credentials.

This is how your file should look like –

Get more information on IAM credentials here, and here.

Create S3 bucket

After installing aws cli, we need to create S3 bucket.

After creating S3 bucket, you need to upload file there.

Upload dataset to S3 bucket

aws s3 cp <local file path> <s3 path>
aws s3 cp ~/nhl-game-data.zip s3://nikita-ds-playground/data/kaggle/nhl-game/

 

Thats it for this post, now this data can be used for analysis purposes from S3 location. In future post we will learn how to use this data for analysis on AWS stack.

Next : Create an EMR cluster with Apache Spark and Apache Zeppelin.

 

Leave a Reply

Your email address will not be published. Required fields are marked *