ConfusedCoders – Page 7 – the world is opensource

Notes on Fenwick tree

The Fenwick tree is more popularly known as the Binary Indexed Tree which works solves the Range Sum Queries problems. The same can be achieved… Read More »Notes on Fenwick tree

Create a basic distributed system in Go lang – Part 1

Yash Sharma
October 2, 2016October 2, 2016
3 Comments
distributed systems, golang

Introduction Yesterday I started looking into basics of Go – A relatively young programming language and though of learning it via creating some useful case… Read More »Create a basic distributed system in Go lang – Part 1

Yash joins Atlassian

Yash Sharma
October 1, 2016October 1, 2016
life

This post have been on hold for a really long time. I have finally moved out of India after long 26 years on this adventurous… Read More »Yash joins Atlassian

Hello Go – Scratching the surface

Yash Sharma
October 1, 2016October 2, 2016
1 Comment
golang

Export Go path export GOPATH=/Users/ysharma/work/go/ Directory Structure . ├── bin │ ├── main │ └── test ├── pkg │ └── darwin_amd64 │ └── rect.a └──… Read More »Hello Go – Scratching the surface

How to connect/query Hive metastore on EMR cluster

Yash Sharma
September 20, 2016September 21, 2016
2 Comments
hive

Just Look for the hive config file – On EMR emr-4.7.2 it is here – less /etc/hive/conf/hive-site.xml Look for the below properties in the hive-site <property> <name>javax.jdo.option.ConnectionURL</name>… Read More »How to connect/query Hive metastore on EMR cluster

How to get the Hive metastore version on EMR cluster

Yash Sharma
September 20, 2016October 1, 2016
hive

Quick note – $ /usr/lib/hive/bin/schematool -dbType mysql -info Metastore connection URL: jdbc:mysql://ip-XX.XX.XX.XX:3306/hive?createDatabaseIfNotExist=true Metastore Connection Driver : org.mariadb.jdbc.Driver Metastore connection User: hive Hive distribution version: 0.14.0… Read More »How to get the Hive metastore version on EMR cluster

Debugging : Hive Dynamic partition Error : [Fatal Error] total number of created files now is 100028, which exceeds 100000. Killing the job.

Yash Sharma
September 6, 2016April 24, 2017
2 Comments
hive

[Fatal Error] total number of created files now is 900320, which exceeds 900000. Killing the job. tldr; quick fix – but probably not the right thing… Read More »Debugging : Hive Dynamic partition Error : [Fatal Error] total number of created files now is 100028, which exceeds 100000. Killing the job.

Spark append mode for partitioned text file fails with SaveMode.Append – IOException File already Exists

Yash Sharma
July 6, 2016July 11, 2016
spark

Code- dataDF.write.partitionBy(“year”, “month”, “date”).mode(SaveMode.Append).text(“s3://data/test2/events/”) Error- 16/07/06 02:15:05 ERROR datasources.DynamicPartitionWriterContainer: Aborting task. java.io.IOException: File already exists:s3://path/1839dd1ed38a.gz at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:614) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:913) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:894) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:791) at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:177)… Read More »Spark append mode for partitioned text file fails with SaveMode.Append – IOException File already Exists

How to write gzip compressed Json in spark data frame

Yash Sharma
June 23, 2016
spark

A compressed format can be specified in spark as : conf = SparkConf() conf.set(“spark.hadoop.mapred.output.compress”, “true”) conf.set(“spark.hadoop.mapred.output.compression.codec”, “true”) conf.set(“spark.hadoop.mapred.output.compression.codec”, “org.apache.hadoop.io.compress.GzipCodec”) conf.set(“spark.hadoop.mapred.output.compression.type”, “BLOCK”) The same can be… Read More »How to write gzip compressed Json in spark data frame

Spark Sql job executing very slow – Performance tuning

Yash Sharma
April 7, 2016April 7, 2016
spark

I have been facing trouble with a basic spark sql job which was unable to process 10’s of gigs in hours. Thats when I demystified… Read More »Spark Sql job executing very slow – Performance tuning

Indexing csv data in Solr via Python – PySolr

Yash Sharma
October 11, 2015October 11, 2015
solr

Here is a crisp post to index Data in Solr using Python. 1. Install Pre-requisites – pip – PySolr 2. Python Script #!/usr/bin/python import sys,… Read More »Indexing csv data in Solr via Python – PySolr

How to get Pig Logical plan (Execution DAG) from Pig Latin script

Yash Sharma
October 11, 2015October 11, 2015
pig

TLDR; A Pig Logical plan is the Plan DAG that is used to execute the chain oj Jobs on Hadoop. Here is the code snippet… Read More »How to get Pig Logical plan (Execution DAG) from Pig Latin script

« Previous
1
…
5
6
7
8
9
…
12
Next »