Hey Readers, I am a Data Science Student and recently I have started learning more about Data Engineering. Data Science and Data Engineering teams co-exist… Read More »Data Engineering Part 1 – How to become a Big Data Engineer
This is part-2 of the blog series – How to analyze Kaggle data with Apache Spark and Zeppelin. In the first part we saw how… Read More »Part 2: How to create EMR cluster with Apache Spark and Apache Zeppelin
We had a EMR cluster reboot and hit this error all of sudden. The error is independent of EMR so worth sharing. Error: Caused by:… Read More »Spark-sql java.net.NoRouteToHostException on cluster reboot
This was a fun debug activity for a Hive-on-S3 use case. Thought of writing a log of debug steps here before I lose the details.… Read More »Debugging : Hive DAG did not succeed due to VERTEX_FAILURE. Unable to rename output.
Just Look for the hive config file – On EMR emr-4.7.2 it is here – less /etc/hive/conf/hive-site.xml Look for the below properties in the hive-site <property> <name>javax.jdo.option.ConnectionURL</name>… Read More »How to connect/query Hive metastore on EMR cluster
Quick note – $ /usr/lib/hive/bin/schematool -dbType mysql -info Metastore connection URL: jdbc:mysql://ip-XX.XX.XX.XX:3306/hive?createDatabaseIfNotExist=true Metastore Connection Driver : org.mariadb.jdbc.Driver Metastore connection User: hive Hive distribution version: 0.14.0… Read More »How to get the Hive metastore version on EMR cluster
Debugging : Hive Dynamic partition Error : [Fatal Error] total number of created files now is 100028, which exceeds 100000. Killing the job.
[Fatal Error] total number of created files now is 900320, which exceeds 900000. Killing the job. tldr; quick fix – but probably not the right thing… Read More »Debugging : Hive Dynamic partition Error : [Fatal Error] total number of created files now is 100028, which exceeds 100000. Killing the job.
Hive fixed length serde can be used in scenarios where we do not have any delimiters in out data file. Using RegexSerDe for fixed length strings… Read More »Use Hive Serde for Fixed Length (index based) strings
Data import in Hive by default expects a directory name in its query specified by LOCATION keyword. By default Hive picks up all the files… Read More »Hive – Selected data import/query – Files and folders (mapred.input.dir.recursive)
Integrating Hive 0.9.0 with HBase 0.94.3 – Identifying root cause for RuntimeException: Error while reading from task log url
The last post here was on integrating Hive 0.11.0 with HBase 0.94.2. But because of issue HIVE-4515 currently we are not able to query HBase… Read More »Integrating Hive 0.9.0 with HBase 0.94.3 – Identifying root cause for RuntimeException: Error while reading from task log url
There is a cool post here on Apache wiki : HBase Hive integration .This post is a simplified compilation of the same. Hive: 0.11.0 HBase:… Read More »HBase Hive integration – Querying HBase via Hive