Do you know the History of Data Science? READ MORE

How to programmatically set the Spark Config

JH
Level 1
How to programmatically set the Spark Config

In a pySpark script you can specify a Spark config, such as this:

spark = SparkSession.builder\
                        .config("spark.executor.cores", "3")\
                        .config("spark.driver.memory", "8g")\
                        .config("spark.cores.max", "15")\
                        .getOrCreate()

 

I can see the configuration I have set by viewing:

sc= spark.sparkContext
confs = sc.getConf().getAll()

However when I view the job on the Spark UI, I can see that the default configuration is being used, and my parameters are overwritten.

 

I assume this is because the parameters in the Dataiku Advanced tab are taking precedence. However, I am using the programmatic API to create these recipes, and therefore cannot manually set the Spark config in the Advanced tab (because this negates the benefit of the programmatic API).

 

Is there a way to use the programmatic API to either set the Spark config or to tell the recipe to ignore the default parameters in favour of the parameters in the recipe?

0 Kudos
1 Reply
SarinaS
Dataiker
Dataiker

Hi @JH,

Indeed it is the case that the spark-submit configurations come from the associated spark configuration, and not configuration parameters set in a code recipe, if you are using a PySpark recipe. 

I think the easiest approach would be either to: 

  • have a number of spark configs that handle most of your use cases, and dynamically set the correct spark config for the recipe from the Python API
  • If there are too many permutations for different configuration options you might want, you could also dynamically set the "override configuration" parameters for a recipe from the Python API for recipes. 

Either approach should be possible from the API. I'll provide an example using the second approach, and similar steps could be taken programmatically to use the first approach as well. 

I have a Python recipe "compute_pyspark" with the following spark configuration settings:

Screen Shot 2021-08-17 at 5.28.51 PM.png

In my Python code that will eventually trigger this recipe, I can modify the "override configurations" to add/change the configuration values:

import dataiku
from dataiku import pandasutils as pdu
import pandas as pd

client = dataiku.api_client()
project = client.get_project('PROJECT_KEY')
recipe = project.get_recipe('RECIPE_KEY')

settings = recipe.get_settings()
# an array of json objects with 'key' -> spark key, 'value' -> key value 
new_conf = [{'key': 'spark.sql.shuffle.partitions', 'value': '22'}]

# update our recipe configuration programmatically
settings.recipe_settings['params']['sparkConfig']['conf'] = new_conf
settings.save()


Now if I return to my recipe, we'll see that the Spark configuration has been updated, and if the recipe is triggered from my Python code, the new configuration settings will be reflected:

Screen Shot 2021-08-17 at 5.35.48 PM.png

If you wanted to change the full configuration setting instead, this can be done via code in the settings.save() step instead:

settings = recipe.get_settings()
# change our config from default to sample-local-config
settings.recipe_settings['params']['sparkConfig']['inheritConf'] = 'sample-local-config'
settings.save()

 

Now the recipe is using the new configuration:

Screen Shot 2021-08-17 at 5.38.05 PM.png

Let me know if you have any questions about these options!

Thanks,
Sarina 

A banner prompting to get Dataiku DSS