hive

Data Engineering Part 1 – How to become a Big Data Engineer

Nikita Sharma
January 15, 2019February 11, 2019
AWS, data engineering, ETL, hive, spark

Hey Readers, I am a Data Science Student and recently I have started learning more about Data Engineering. Data Science and Data Engineering teams co-exist… Read More »Data Engineering Part 1 – How to become a Big Data Engineer

Query S3 data via Hive on local box

Nikita Sharma
December 28, 2018December 29, 2018
hive

In the last post we discussed about how to generate synthetic data. Here we will talk about how to query S3 data via Hive. Provide… Read More »Query S3 data via Hive on local box

Part 2: How to create EMR cluster with Apache Spark and Apache Zeppelin

Nikita Sharma
October 28, 2018October 28, 2018
AWS, data engineering, hive, spark

This is part-2 of the blog series – How to analyze Kaggle data with Apache Spark and Zeppelin. In the first part we saw how… Read More »Part 2: How to create EMR cluster with Apache Spark and Apache Zeppelin

Debugging : Hive Dynamic partition Error : [Fatal Error] total number of created files now is 100028, which exceeds 100000. Killing the job.

Yash Sharma
September 6, 2016April 24, 2017
2 Comments
hive

[Fatal Error] total number of created files now is 900320, which exceeds 900000. Killing the job. tldr; quick fix – but probably not the right thing… Read More »Debugging : Hive Dynamic partition Error : [Fatal Error] total number of created files now is 100028, which exceeds 100000. Killing the job.

Hive – Selected data import/query – Files and folders (mapred.input.dir.recursive)

Yash Sharma
December 25, 2013May 27, 2014
hive

Data import in Hive by default expects a directory name in its query specified by LOCATION keyword. By default Hive picks up all the files… Read More »Hive – Selected data import/query – Files and folders (mapred.input.dir.recursive)