How to programmatically set the Spark Config

JH
JH Registered Posts: 2 ✭✭✭
edited July 16 in Using Dataiku

In a pySpark script you can specify a Spark config, such as this:

spark = SparkSession.builder\
                        .config("spark.executor.cores", "3")\
                        .config("spark.driver.memory", "8g")\
                        .config("spark.cores.max", "15")\
                        .getOrCreate()

I can see the configuration I have set by viewing:

sc= spark.sparkContext
confs = sc.getConf().getAll()

However when I view the job on the Spark UI, I can see that the default configuration is being used, and my parameters are overwritten.

I assume this is because the parameters in the Dataiku Advanced tab are taking precedence. However, I am using the programmatic API to create these recipes, and therefore cannot manually set the Spark config in the Advanced tab (because this negates the benefit of the programmatic API).

Is there a way to use the programmatic API to either set the Spark config or to tell the recipe to ignore the default parameters in favour of the parameters in the recipe?

Answers

  • Sarina
    Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 317 Dataiker
    edited July 17

    Hi @JH
    ,

    Indeed it is the case that the spark-submit configurations come from the associated spark configuration, and not configuration parameters set in a code recipe, if you are using a PySpark recipe.

    I think the easiest approach would be either to:

    • have a number of spark configs that handle most of your use cases, and dynamically set the correct spark config for the recipe from the Python API
    • If there are too many permutations for different configuration options you might want, you could also dynamically set the "override configuration" parameters for a recipe from the Python API for recipes.

    Either approach should be possible from the API. I'll provide an example using the second approach, and similar steps could be taken programmatically to use the first approach as well.

    I have a Python recipe "compute_pyspark" with the following spark configuration settings:

    Screen Shot 2021-08-17 at 5.28.51 PM.png

    In my Python code that will eventually trigger this recipe, I can modify the "override configurations" to add/change the configuration values:

    import dataiku
    from dataiku import pandasutils as pdu
    import pandas as pd
    
    client = dataiku.api_client()
    project = client.get_project('PROJECT_KEY')
    recipe = project.get_recipe('RECIPE_KEY')
    
    settings = recipe.get_settings()
    # an array of json objects with 'key' -> spark key, 'value' -> key value 
    new_conf = [{'key': 'spark.sql.shuffle.partitions', 'value': '22'}]
    
    # update our recipe configuration programmatically
    settings.recipe_settings['params']['sparkConfig']['conf'] = new_conf
    settings.save()


    Now if I return to my recipe, we'll see that the Spark configuration has been updated, and if the recipe is triggered from my Python code, the new configuration settings will be reflected:

    Screen Shot 2021-08-17 at 5.35.48 PM.png

    If you wanted to change the full configuration setting instead, this can be done via code in the settings.save() step instead:

    settings = recipe.get_settings()
    # change our config from default to sample-local-config
    settings.recipe_settings['params']['sparkConfig']['inheritConf'] = 'sample-local-config'
    settings.save()
    

    Now the recipe is using the new configuration:

    Screen Shot 2021-08-17 at 5.38.05 PM.png

    Let me know if you have any questions about these options!

    Thanks,
    Sarina 

Setup Info
    Tags
      Help me…