A compressed format can be specified in spark as :
conf = SparkConf()
conf.set("spark.hadoop.mapred.output.compress", "true")
conf.set("spark.hadoop.mapred.output.compression.codec", "true")
conf.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec")
conf.set("spark.hadoop.mapred.output.compression.type", "BLOCK")
The same can be provided to spark shell as:
$> spark-shel --conf spark.hadoop.mapred.output.compress=true --conf spark.hadoop.mapred.output.compression.codec=true --conf spark.hadoop.mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec --conf spark.hadoop.mapred.output.compression.type=BLOCK
The code for writing the Json/Text is same as usual-
case class C(key: String, value: String)
val list = List(C("a", "b"), C("c", "d"), C("e", "f"))
val rdd = sc.makeRDD(list)
import sqlContext.implicits._
val df = rdd.toDF
df.write.mode("append").json("s3://work/data/tests/json")
Thats it. We should now have compressed GZ files as output.