How to generate synthetic log data for data analysis

A lot of time, we want some synthetic data to start our journey on data analysis. In this post we will discuss how to generate synthetic log data.

In this post we will :

  • Generate log data
  • Copy data on S3

 

Generate log data

I came across this great git repo that can generate synthetic access log for us. We will use this git repo for our data generation.

 Clone the repo

We will clone the git repo using following command :

$ git clone https://github.com/kiritbasu/Fake-Apache-Log-Generator.git
$ cd  Fake-Apache-Log-Generator

After cloning the data, let’s have a look on readme file to learn about its usage.

 $ less README.md

Install dependencies

We will install dependencies to run the python script. Installing dependencies directly can mess up with other python libraries. So to avoid that, we will create a virtual env to install our dependencies.

Create and activate virtual env :

$ virtualenv -p python2.7 venv
$ ../venv/bin/activate

Note : Install virtualenv if  not already installed

$ pip install virtualenv

Install dependencies:

$ pip install -r requirements.txt

Run the script
Let’s print some sample data on terminal

$ python apache-fake-log-gen.py -n 20

Now let’s create a zip file of log

$ python apache-fake-log-gen.py -n 20 -o GZ 

To generate multiple log files, we  just have to run the command multiple times.

 

Copy data to S3

To copy the data to S3, first we need to decide the path where we will kept our data.

It’s always a good idea to keep the raw data separate from the processed/cleaned data. Its also a good idea to keep the data in date sub directories/paths.
I have selected this path

 s3://<your-S3-bucket>/raw/access-log/2018-12-28/

We need to setup AWS cli for copying data to S3.

Here is a link to copy data from local system to S3 bucket :
https://confusedcoders.com/data-engineering/how-to-copy-kaggle-data-to-amazon-s3
Command to copy data to s3 :

 
$ aws s3 cp ./access_log_20181228-130813.log.gz s3://<your-S3-bucket>/raw/access-log/2018-12-28/
$ aws s3 cp ./access_log_20181228-132020.log.gz s3://<your-S3-bucket>/raw/access-log/2018-12-28/
$ aws s3 cp ./access_log_20181228-132022.log.gz s3://<your-S3-bucket>/raw/access-log/2018-12-28/

Check all files are copied

$ aws s3 ls s3://<your-S3-bucket>/raw/access-log/2018-12-28/

I have created a python script that can generate synthetic log data and copy to S3 directly.

 

Leave a Reply

Your email address will not be published. Required fields are marked *