Hey readers, I am learning Data Engineering from last few months and I thought of sharing my learning with you all. Recently I made a project on a real life application of a Big Data pipeline and thought to share with you so that if anyone is interested they can go through this and practice for self learning .
Nikita is a new Data Engineer at a great startup. The company is currently getting lots of Apache/application logs being copied to Amazon S3. Nikita’s team is assigned the task to process all of the high volume logs, and to expose this data to the Analyst and Scientist of the organisation.
The Analyst and Scientist prefer consuming the data as SQL or python. The primary objective for Nikita’s team is to make the access to data easier and optimised.
Here I am talking about the problem statements and not about the solutions because at this point I don’t want to bias your opinions with my solutions. After knowing the problems we will discuss how to implement the solutions.
We want to build a Data lake to store all our structured and unstructured data in one place.
Big Data Concepts
We need to get better at the Big data concepts by applying optimization techniques on our data, like:
- Partitioned data/tables
- Using columnar format like parquet
Getting the Data
Since this is a real world project, it might be worth having some real world data as well. Since this is apache log format data we might need to get some log data for our project. It might be cool if we can copy the log data to S3 in intervals to mimic the production like scenario.
Big Data ETLs
We will create daily/hourly job to process new incoming data and to create aggregate/summary tables on our data to make queries faster on the data. The ETLs we need to create are
- Daily job to process raw incoming data (Apache log format) and to clean and extract relevant columns from data
- Hourly/daily summary job to aggregate raw tables. These tables will be exposed to the analysts for querying. Since these tables are summarised view, the queries are being faster as compared to the raw table.
Note: All the ETLs have to be incremental, that can run on hourly/daily schedule.
Big Data Processing Engine and infrastructure
We need to decide any Big data engine for our data processing ETLs and we need to decide any cloud based infrastructure that supports our execution engine. We will talk about the solution later in this post.
Now we have Big data in place, we need to schedule all our Big data ETLs to run hourly/daily basis.
Exposing the Data to Analysts
Once the ETLs starts running incrementally we will have all our data organised in tables, we will now need to expose our data to Analysts. The Analysts and Scientists are the final consumers of the data. Since the Analysts are comfortable with SQL and Python, we have to keep that in consideration while selecting the right tools.
This is one of the very standard industries problems and there are several ways the problem can be approached and solved. There is no right and wrong solutions to the problems as long as we can balance the expectations and tradeoffs.
Here is the design I came of with, and use could choose to replace any of the component with something you are more comfortable with.
Amazon S3 as Data Lake
Here I have used Amazon S3 as data lake which can be used to store and retrieve any amount of data, at any time, from anywhere on the web. You can choose any other storage service to store your data in one place. I have used Amazon S3 because :
- It is easy to use
- It stores both structured and unstructured data
- It is cost effective
- It allows the user to run the Big data analytics on a particular system without moving it to another analytics system.
Airflow as Job Scheduler
In this project, I have schedule my Big data ETLs through Airflow which can be used to manage data pipeline. I choose Airflow because:
- It provides nice visibility over daily runs
- Easy failure and crash recovery
- Lots of predefined operators and library for integration
Apache Spark on Amazon EMR Cluster
Here I have created EMR cluster which helps in big data processing and analysis. EMR basically provides the infrastructure to run Apache Hadoop, Spark, Zeppelin, Hive, etc on Amazon Web Services.
I chose Apache Spark because :
- It supports multiple languages
- It supports hive-SQL
- It has integration with S3 and can read structured/unstructured data.
- It supports both streaming and batch.
Big Data ETLs
For this project,I have generated a log data and stored my generated data in Amazon S3.We also need to create a clean data or extract table from raw log data partitioned by date via Spark-Sql or Hive and store in S3 as parquet file format. And then we can also create aggregated summary table and store to S3.
In the previous post I have explained :
- How to generate log data
- How to create a Big Data ETL project
- How to Productionizing Big data ETL with Apache Airflow
Apache Zeppelin for consumption layer
Here I chose Apache Zeppelin for exposing the Data to Analysts. Apache Zeppelin is a web-based notebook, where we can explore our data with built in visualizations and supports SQL, Scala and Python. By default, Apache Zeppelin provides some basic charts to display the result of our processing data.
In previous post, I have discussed how to query data via Apache Zeppelin.
All the code
All the code for our project can be found here – https://github.com/nikkisharma536/airflow-jobs
Hope you like this post. Stay tuned for more blogs.
Checkout my portfolio here: https://confusedcoders.com/nikita-sharma-greenhorn-data-science-student
I am a greenhorn Data Science student with interest in finding patterns in data. My language of choice is Python and I am starting to get my hands dirty with R.
I blog on Medium.com  and ConfusedCoders.com . I share my code on Github.com .