spark

Real world application project for Big Data – with Apache Spark and AWS-EMR

Nikita Sharma
February 24, 2019February 24, 2019
airflow, AWS, data engineering, data storage, ETL, spark

Hey readers, I am learning Data Engineering from last few months and I thought of sharing my learning with you all. Recently I made a project… Read More »Real world application project for Big Data – with Apache Spark and AWS-EMR

Data Engineering Part 1 – How to become a Big Data Engineer

Nikita Sharma
January 15, 2019February 11, 2019
AWS, data engineering, ETL, hive, spark

Hey Readers, I am a Data Science Student and recently I have started learning more about Data Engineering. Data Science and Data Engineering teams co-exist… Read More »Data Engineering Part 1 – How to become a Big Data Engineer

Public Speaking at Web Analytics Wednesday Meetup, Sydney

Nikita Sharma
December 12, 2018December 12, 2018
public speaking, spark

It was an eventful evening yesterday when I gave my first ever talk. I talked on Analytics with Apache Spark and Zeppelin on Amazon… Read More »Public Speaking at Web Analytics Wednesday Meetup, Sydney

Handpicked Spark configs to make the job runs faster

Yash Sharma
November 9, 2018November 9, 2018
AWS, data engineering, spark

Here is a collection of spark configs that have helped make the job runs faster. Most of the configs come with trade-offs but work very… Read More »Handpicked Spark configs to make the job runs faster

Query Kaggle data via Apache Spark and Zeppelin via EMR cluster

Nikita Sharma
October 29, 2018October 29, 2018
AWS, data engineering, spark

This is a 3 post series on querying Kaggle data on EMR cluster. I will be using Apache Zeppein for the data exploration, and internally… Read More »Query Kaggle data via Apache Spark and Zeppelin via EMR cluster

Part 3: Query Kaggle data via Apache Zeppelin

Nikita Sharma
October 29, 2018February 24, 2019
AWS, data engineering, spark

This is part-3 of the blog series – How to analyze Kaggle data with Apache Spark and Zeppelin. In the first part we saw how to copy Kaggle data… Read More »Part 3: Query Kaggle data via Apache Zeppelin

Part 2: How to create EMR cluster with Apache Spark and Apache Zeppelin

Nikita Sharma
October 28, 2018October 28, 2018
AWS, data engineering, hive, spark

This is part-2 of the blog series – How to analyze Kaggle data with Apache Spark and Zeppelin. In the first part we saw how… Read More »Part 2: How to create EMR cluster with Apache Spark and Apache Zeppelin

Spark-sql java.net.NoRouteToHostException on cluster reboot

Yash Sharma
July 12, 2017July 12, 2017
hive, spark

We had a EMR cluster reboot and hit this error all of sudden. The error is independent of EMR so worth sharing. Error: Caused by:… Read More »Spark-sql java.net.NoRouteToHostException on cluster reboot

Spark append mode for partitioned text file fails with SaveMode.Append – IOException File already Exists

Yash Sharma
July 6, 2016July 11, 2016
spark

Code- dataDF.write.partitionBy(“year”, “month”, “date”).mode(SaveMode.Append).text(“s3://data/test2/events/”) Error- 16/07/06 02:15:05 ERROR datasources.DynamicPartitionWriterContainer: Aborting task. java.io.IOException: File already exists:s3://path/1839dd1ed38a.gz at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:614) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:913) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:894) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:791) at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:177)… Read More »Spark append mode for partitioned text file fails with SaveMode.Append – IOException File already Exists

How to write gzip compressed Json in spark data frame

Yash Sharma
June 23, 2016
spark

A compressed format can be specified in spark as : conf = SparkConf() conf.set(“spark.hadoop.mapred.output.compress”, “true”) conf.set(“spark.hadoop.mapred.output.compression.codec”, “true”) conf.set(“spark.hadoop.mapred.output.compression.codec”, “org.apache.hadoop.io.compress.GzipCodec”) conf.set(“spark.hadoop.mapred.output.compression.type”, “BLOCK”) The same can be… Read More »How to write gzip compressed Json in spark data frame

Spark Sql job executing very slow – Performance tuning

Yash Sharma
April 7, 2016April 7, 2016
spark

I have been facing trouble with a basic spark sql job which was unable to process 10’s of gigs in hours. Thats when I demystified… Read More »Spark Sql job executing very slow – Performance tuning

Minimal Spark hello world

Yash Sharma
October 11, 2015
spark

1. Build Sbt Create a build.sbt file. This manages all dependencies and stuffs that would had been in your pom file- import AssemblyKeys._ import sbtassembly.Plugin._… Read More »Minimal Spark hello world