Here is a collection of spark configs that have helped make the job runs faster. Most of the configs come with trade-offs but work very well for particular scenarios.
Make spark write lesser files
If we’re reading from a table with 100000’s of files, spark is going to take parallelism very seriously and write out 10000’s of output files. This plays really bad with S3, because at the end of the job it will rename files on S3 sequentially which will then take another hour for the job to finish. We need to reduce the output files of query to avoid the S3 copies, and for downstream jobs as well.
To reduce spark parallelism for shuffles in spark-sql. Note: It only works if your query has a shuffle like order-by/ group by etc. It won’t work for simple select-> filter -> insert queries.
To reduce the number of files if no shuffles are involved. Note: I’ve picked a very very high value (16gigs) find an appropriate lower value that works for you. Needs more executor memory but works for simple select insert queries without shuffles.
Make spark table creation and first time query faster
Sometimes when a table is created in spark, and when we query it for the first time, it tries to infer schema for the table. It takes ridiculously long time sometimes, and even fails in 20-30 mins. This is a particular issue while using Spark with Amazon Glue metastore.
Turning off schema case insensitive mode switches of the expensive schema inference and makes the query succeed instantly.
Spark-sql insert-overwrite File already exists exceptions
When a task is slow spark tries the speculation execution to make another copy of the same task run. This is particularly annoying with Amazon S3 and eventual consistency. Turn of Spark speculation for Spark-S3 insert queries.
Spark S3 retries
These are again helpful when spark isn’t able to write the data to S3 but fails. TBH this is not very common error, but these configs have helped us sometimes.
Spark dynamic execution
Set dynamic execution to true to nat have to set the number of executors for your task. Works great, except for one particular case,
Spark is very hungry for resources and will take up every resource your cluster has. If you have a shared cluster with multiple apps, and you have dynamic allocation enabled, make sure you limit the max executors. Else the first job would take up your entire queue resources.
Another advantage of this limit if if you’re reading from a table that has 10000’s of files. Spark sometimes creates 10000’s of executors and that kills the driver node. Limiting the max executors helps.
I will keep adding to this list as I keep struggling with making Spark jobs faster. Cheers
Yash Sharma is a Big Data & Machine Learning Engineer, A newbie OpenSource contributor, Plays guitar and enjoys teaching as part time hobby.
Talk to Yash about Distributed Systems and Data platform designs.