How to write gzip compressed Json in spark data frame

A compressed format can be specified in spark as :

conf = SparkConf()
conf.set("spark.hadoop.mapred.output.compress", "true")
conf.set("spark.hadoop.mapred.output.compression.codec", "true")
conf.set("spark.hadoop.mapred.output.compression.codec", "")
conf.set("spark.hadoop.mapred.output.compression.type", "BLOCK")

The same can be provided to spark shell as:

$> spark-shel --conf spark.hadoop.mapred.output.compress=true --conf spark.hadoop.mapred.output.compression.codec=true --conf --conf spark.hadoop.mapred.output.compression.type=BLOCK

The code for writing the Json/Text is same as usual-

case class C(key: String, value: String)
val list = List(C("a", "b"), C("c", "d"), C("e", "f"))
val rdd = sc.makeRDD(list)
import sqlContext.implicits._
val df = rdd.toDF

Thats it. We should now have compressed GZ files as output.



Yash Sharma is a Big Data & Machine Learning Engineer, A newbie OpenSource contributor, Plays guitar and enjoys teaching as part time hobby.
Talk to Yash about Distributed Systems and Data platform designs.

Leave a Reply

Your email address will not be published. Required fields are marked *