A lot of time, we want some synthetic data to start our journey on data analysis. In this post we will discuss how to generate synthetic log data.
In this post we will :
- Generate log data
- Copy data on S3
Generate log data
I came across this great git repo that can generate synthetic access log for us. We will use this git repo for our data generation.
Clone the repo
We will clone the git repo using following command :
$ git clone https://github.com/kiritbasu/Fake-Apache-Log-Generator.git $ cd Fake-Apache-Log-Generator
After cloning the data, let’s have a look on readme file to learn about its usage.
$ less README.md
Install dependencies
We will install dependencies to run the python script. Installing dependencies directly can mess up with other python libraries. So to avoid that, we will create a virtual env to install our dependencies.
Create and activate virtual env :
$ virtualenv -p python2.7 venv $ ../venv/bin/activate
Note : Install virtualenv if not already installed
$ pip install virtualenv
Install dependencies:
$ pip install -r requirements.txt
Run the script
Let’s print some sample data on terminal
$ python apache-fake-log-gen.py -n 20
Now let’s create a zip file of log
$ python apache-fake-log-gen.py -n 20 -o GZ
To generate multiple log files, we just have to run the command multiple times.
Copy data to S3
To copy the data to S3, first we need to decide the path where we will kept our data.
It’s always a good idea to keep the raw data separate from the processed/cleaned data. Its also a good idea to keep the data in date sub directories/paths.
I have selected this path
s3://<your-S3-bucket>/raw/access-log/2018-12-28/
We need to setup AWS cli for copying data to S3.
Here is a link to copy data from local system to S3 bucket :
https://confusedcoders.com/data-engineering/how-to-copy-kaggle-data-to-amazon-s3
Command to copy data to s3 :
$ aws s3 cp ./access_log_20181228-130813.log.gz s3://<your-S3-bucket>/raw/access-log/2018-12-28/ $ aws s3 cp ./access_log_20181228-132020.log.gz s3://<your-S3-bucket>/raw/access-log/2018-12-28/ $ aws s3 cp ./access_log_20181228-132022.log.gz s3://<your-S3-bucket>/raw/access-log/2018-12-28/
Check all files are copied
$ aws s3 ls s3://<your-S3-bucket>/raw/access-log/2018-12-28/
I have created a python script that can generate synthetic log data and copy to S3 directly.