How to write gzip compressed Json in spark data frame

A compressed format can be specified in spark as :

conf = SparkConf()
conf.set("spark.hadoop.mapred.output.compress", "true")
conf.set("spark.hadoop.mapred.output.compression.codec", "true")
conf.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec")
conf.set("spark.hadoop.mapred.output.compression.type", "BLOCK")

The same can be provided to spark shell as:

$> spark-shel --conf spark.hadoop.mapred.output.compress=true --conf spark.hadoop.mapred.output.compression.codec=true --conf spark.hadoop.mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec --conf spark.hadoop.mapred.output.compression.type=BLOCK

The code for writing the Json/Text is same as usual-

case class C(key: String, value: String)
val list = List(C("a", "b"), C("c", "d"), C("e", "f"))
val rdd = sc.makeRDD(list)
import sqlContext.implicits._
val df = rdd.toDF
df.write.mode("append").json("s3://work/data/tests/json")

Thats it. We should now have compressed GZ files as output.

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *