I have been facing trouble with a basic spark sql job which was unable to process 10’s of gigs in hours. Thats when I demystified the ‘spark.sql.shuffle.partitions’ which tends to slow down the job insanely.
Adding the below changes to the Spark Sql code fixes the issue for me. Magic.
// For handling large number of smaller files events = sqlContext.createDataFrame(rows]).coalesce(400) events.registerTempTable("input_events") // For overriding default value of 200 sqlContext.sql("SET spark.sql.shuffle.partitions=10") sqlContext.sql(sql_query)
Yash Sharma is a Big Data & Machine Learning Engineer, A newbie OpenSource contributor, Plays guitar and enjoys teaching as part time hobby.
Talk to Yash about Distributed Systems and Data platform designs.