ETL

Real world application project for Big Data – with Apache Spark and AWS-EMR

Nikita Sharma
February 24, 2019February 24, 2019
airflow, AWS, data engineering, data storage, ETL, spark

Hey readers, I am learning Data Engineering from last few months and I thought of sharing my learning with you all. Recently I made a project… Read More »Real world application project for Big Data – with Apache Spark and AWS-EMR

Data Engineering Part 2 – Productionizing Big data ETL with Apache Airflow

Nikita Sharma
February 11, 2019February 11, 2019
airflow, data engineering, ETL

Hey readers, in previous post I have explained How to create a python ETL Project. In this post, I will explain how we can schedule/productionize our… Read More »Data Engineering Part 2 – Productionizing Big data ETL with Apache Airflow

Data Engineering Part 1 – How to become a Big Data Engineer

Nikita Sharma
January 15, 2019February 11, 2019
AWS, data engineering, ETL, hive, spark

Hey Readers, I am a Data Science Student and recently I have started learning more about Data Engineering. Data Science and Data Engineering teams co-exist… Read More »Data Engineering Part 1 – How to become a Big Data Engineer

Query S3 data via Hive on local box

Nikita Sharma
December 28, 2018December 29, 2018
hive

In the last post we discussed about how to generate synthetic data. Here we will talk about how to query S3 data via Hive. Provide… Read More »Query S3 data via Hive on local box

Part 2: How to create EMR cluster with Apache Spark and Apache Zeppelin

Nikita Sharma
October 28, 2018October 28, 2018
AWS, data engineering, hive, spark

This is part-2 of the blog series – How to analyze Kaggle data with Apache Spark and Zeppelin. In the first part we saw how… Read More »Part 2: How to create EMR cluster with Apache Spark and Apache Zeppelin

Spark-sql java.net.NoRouteToHostException on cluster reboot

Yash Sharma
July 12, 2017July 12, 2017
hive, spark

We had a EMR cluster reboot and hit this error all of sudden. The error is independent of EMR so worth sharing. Error: Caused by:… Read More »Spark-sql java.net.NoRouteToHostException on cluster reboot

Debugging : Hive DAG did not succeed due to VERTEX_FAILURE. Unable to rename output.

Yash Sharma
April 24, 2017April 26, 2017
hive

This was a fun debug activity for a Hive-on-S3 use case. Thought of writing a log of debug steps here before I lose the details.… Read More »Debugging : Hive DAG did not succeed due to VERTEX_FAILURE. Unable to rename output.

How to connect/query Hive metastore on EMR cluster

Yash Sharma
September 20, 2016September 21, 2016
2 Comments
hive

Just Look for the hive config file – On EMR emr-4.7.2 it is here – less /etc/hive/conf/hive-site.xml Look for the below properties in the hive-site <property> <name>javax.jdo.option.ConnectionURL</name>… Read More »How to connect/query Hive metastore on EMR cluster

How to get the Hive metastore version on EMR cluster

Yash Sharma
September 20, 2016October 1, 2016
hive

Quick note – $ /usr/lib/hive/bin/schematool -dbType mysql -info Metastore connection URL: jdbc:mysql://ip-XX.XX.XX.XX:3306/hive?createDatabaseIfNotExist=true Metastore Connection Driver : org.mariadb.jdbc.Driver Metastore connection User: hive Hive distribution version: 0.14.0… Read More »How to get the Hive metastore version on EMR cluster

Debugging : Hive Dynamic partition Error : [Fatal Error] total number of created files now is 100028, which exceeds 100000. Killing the job.

Yash Sharma
September 6, 2016April 24, 2017
2 Comments
hive

[Fatal Error] total number of created files now is 900320, which exceeds 900000. Killing the job. tldr; quick fix – but probably not the right thing… Read More »Debugging : Hive Dynamic partition Error : [Fatal Error] total number of created files now is 100028, which exceeds 100000. Killing the job.

How to get Pig Logical plan (Execution DAG) from Pig Latin script

Yash Sharma
October 11, 2015October 11, 2015
pig

TLDR; A Pig Logical plan is the Plan DAG that is used to execute the chain oj Jobs on Hadoop. Here is the code snippet… Read More »How to get Pig Logical plan (Execution DAG) from Pig Latin script

Use Hive Serde for Fixed Length (index based) strings

Yash Sharma
May 12, 2014May 12, 2014
2 Comments
hive

Hive fixed length serde can be used in scenarios where we do not have any delimiters in out data file. Using RegexSerDe for fixed length strings… Read More »Use Hive Serde for Fixed Length (index based) strings