Spark Sql job executing very slow – Performance tuning

I have been facing trouble with a basic spark sql job which was unable to process 10’s of gigs in hours. Thats when I demystified the ‘spark.sql.shuffle.partitions’ which tends to slow down the job insanely.

Adding the below changes to the Spark Sql code fixes the issue for me. Magic.

// For handling large number of smaller files
events = sqlContext.createDataFrame(rows]).coalesce(400) 
events.registerTempTable("input_events")
// For overriding default value of 200
sqlContext.sql("SET spark.sql.shuffle.partitions=10")
sqlContext.sql(sql_query)

Yash Sharma is a Big Data & Machine Learning Engineer, A newbie OpenSource contributor, Plays guitar and enjoys teaching as part time hobby.
Talk to Yash about Distributed Systems and Data platform designs.

Leave a Reply

Your email address will not be published. Required fields are marked *