Hey readers, I am learning Data Engineering from last few months and I thought of sharing my learning with you all. Recently I made a project… Read More »Real world application project for Big Data – with Apache Spark and AWS-EMR
Hey readers, in previous post I have explained How to create a python ETL Project. In this post, I will explain how we can schedule/productionize our… Read More »Data Engineering Part 2 – Productionizing Big data ETL with Apache Airflow
Hey Readers, I am a Data Science Student and recently I have started learning more about Data Engineering. Data Science and Data Engineering teams co-exist… Read More »Data Engineering Part 1 – How to become a Big Data Engineer
This is part-2 of the blog series – How to analyze Kaggle data with Apache Spark and Zeppelin. In the first part we saw how… Read More »Part 2: How to create EMR cluster with Apache Spark and Apache Zeppelin
We had a EMR cluster reboot and hit this error all of sudden. The error is independent of EMR so worth sharing. Error: Caused by:… Read More »Spark-sql java.net.NoRouteToHostException on cluster reboot
This was a fun debug activity for a Hive-on-S3 use case. Thought of writing a log of debug steps here before I lose the details.… Read More »Debugging : Hive DAG did not succeed due to VERTEX_FAILURE. Unable to rename output.
Just Look for the hive config file – On EMR emr-4.7.2 it is here – less /etc/hive/conf/hive-site.xml Look for the below properties in the hive-site <property> <name>javax.jdo.option.ConnectionURL</name>… Read More »How to connect/query Hive metastore on EMR cluster
Quick note – $ /usr/lib/hive/bin/schematool -dbType mysql -info Metastore connection URL: jdbc:mysql://ip-XX.XX.XX.XX:3306/hive?createDatabaseIfNotExist=true Metastore Connection Driver : org.mariadb.jdbc.Driver Metastore connection User: hive Hive distribution version: 0.14.0… Read More »How to get the Hive metastore version on EMR cluster
Debugging : Hive Dynamic partition Error : [Fatal Error] total number of created files now is 100028, which exceeds 100000. Killing the job.
[Fatal Error] total number of created files now is 900320, which exceeds 900000. Killing the job. tldr; quick fix – but probably not the right thing… Read More »Debugging : Hive Dynamic partition Error : [Fatal Error] total number of created files now is 100028, which exceeds 100000. Killing the job.
TLDR; A Pig Logical plan is the Plan DAG that is used to execute the chain oj Jobs on Hadoop. Here is the code snippet… Read More »How to get Pig Logical plan (Execution DAG) from Pig Latin script
Hive fixed length serde can be used in scenarios where we do not have any delimiters in out data file. Using RegexSerDe for fixed length strings… Read More »Use Hive Serde for Fixed Length (index based) strings