General / Rule of Thumb Spark Configuration Settings

Solved!
importthepandas
General / Rule of Thumb Spark Configuration Settings

We are using managed spark over kubernetes in EKS. We have about 80 active users on our design node, about 1/2 of them use spark regularly. We've tried to make things easy by creating simple spark configurations but are finding that we continuously are changing configurations. With multiple spark applications, has anyone usedspark.dynamicAllocation.enabled? Has anyone established good "boilerplate" configs for different spark workloads in their deployments?


Operating system used: ubuntu 18

0 Kudos
1 Solution
AlexT
Dataiker

Hi @importthepandas,


We typically suggest going with a few boiler plate configs and then modifying as needed based on your need and sometimes even on recipe basis rather then instance level config:

Common config could look like: 

spark.sql.broadcastTimeout = 3600
spark.port.maxRetries = 200
spark.executor.extraJavaOptions=-Duser.timezone=GMT
spark.driver.extraJavaOptions=-Duser.timezone=GMT
spark.driver.host = 10.192.71.7
spark.network.timeout = 500s

Default:

spark.executor.memory = 4g
spark.kubernetes.memoryOverheadFactor = 0.4
spark.driver.memory = 2g
spark.executor.instances = 4
spark.executor.cores = 2
spark.sql.shuffle.partitions = 40

larger config :

spark.executor.memory = 4g
spark.kubernetes.memoryOverheadFactor = 0.4
spark.driver.memory = 2g
spark.executor.instances = 8
spark.executor.cores = 2
spark.sql.shuffle.partitions = 80


High( for jobs that failed with previous config)

spark.executor.memory = 6g
spark.kubernetes.memoryOverheadFactor = 0.6
spark.driver.memory = 3g
spark.executor.instances = 8
spark.executor.cores = 2
spark.sql.shuffle.partitions = 80

On a per-job basis, you may want to touch only, as needed:
* spark.executor.instances 
* spark.sql.shuffle.partitions
* spark.executor.memory / spark.kubernetes.memoryOverheadFactor where you have OOM errors

For other options like spark.dynamicAllocation.enabled is typically best left as the default false. From what I can see, this is still being worked on 
https://spark.apache.org/docs/latest/running-on-kubernetes.html#future-work


View solution in original post

2 Replies
AlexT
Dataiker

Hi @importthepandas,


We typically suggest going with a few boiler plate configs and then modifying as needed based on your need and sometimes even on recipe basis rather then instance level config:

Common config could look like: 

spark.sql.broadcastTimeout = 3600
spark.port.maxRetries = 200
spark.executor.extraJavaOptions=-Duser.timezone=GMT
spark.driver.extraJavaOptions=-Duser.timezone=GMT
spark.driver.host = 10.192.71.7
spark.network.timeout = 500s

Default:

spark.executor.memory = 4g
spark.kubernetes.memoryOverheadFactor = 0.4
spark.driver.memory = 2g
spark.executor.instances = 4
spark.executor.cores = 2
spark.sql.shuffle.partitions = 40

larger config :

spark.executor.memory = 4g
spark.kubernetes.memoryOverheadFactor = 0.4
spark.driver.memory = 2g
spark.executor.instances = 8
spark.executor.cores = 2
spark.sql.shuffle.partitions = 80


High( for jobs that failed with previous config)

spark.executor.memory = 6g
spark.kubernetes.memoryOverheadFactor = 0.6
spark.driver.memory = 3g
spark.executor.instances = 8
spark.executor.cores = 2
spark.sql.shuffle.partitions = 80

On a per-job basis, you may want to touch only, as needed:
* spark.executor.instances 
* spark.sql.shuffle.partitions
* spark.executor.memory / spark.kubernetes.memoryOverheadFactor where you have OOM errors

For other options like spark.dynamicAllocation.enabled is typically best left as the default false. From what I can see, this is still being worked on 
https://spark.apache.org/docs/latest/running-on-kubernetes.html#future-work


importthepandas
Author

Thank you @AlexT  - this is great!

0 Kudos