Spark settings of the running recipe

NikolayK
NikolayK Partner, Registered Posts: 14 Partner

I would like to change Spark recipe global settings in DSS 9 according to the way proposed in How to programmatically set the Spark Config .

However, I'd like to do it from inside a running recipe. How do I get a reference to it in Python without having to hardcode the project and recipe keys like suggested in the referenced discussion?

project = client.get_project('PROJECT_KEY')
recipe = project.get_recipe('RECIPE_KEY')

settings = recipe.get_settings()

Best Answer

  • Sarina
    Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 317 Dataiker
    edited July 17 Answer ✓

    Hi @NikolayK
    ,

    Interesting, so you will just set the partitions for all future executions of the recipe?

    If so I think that is ok, as long as you know that the current run will never be influenced by the recipe modification. In that case, you can do something like the following. For the project, you can use client.get_default_project() to get the current project:

    import dataiku
    from dataiku import pandasutils as pdu
    from dataiku import spark as dkuspark
    from pyspark import SparkContext
    import pandas as pd
    
    client = dataiku.api_client()
    project = client.get_default_project()
    
    sc = SparkContext.getOrCreate()
    sqlContext = SQLContext(sc)


    And then for the current recipe, you can either use the os package and parse the recipe name from your os.environ, or you can parse the recipe name from the sc.appName field. Here's an example of the latter:

    # where sc.appName returns DSS (Py): recipe_activity
    recipe_name =  '_'.join((sc.appName.split(':')[1].split('_'))[:-1])
    recipe = project.get_recipe(recipe_name)
    settings = recipe.get_settings()
    new_conf = [{'key': 'spark.sql.shuffle.partitions', 'value': '22'}]
    
    settings.recipe_settings['params']['sparkConfig']['conf'] = new_conf
    settings.save()


    Or using the os environ:

    import os 
    import json 
    
    custom_vars = json.loads(os.environ['DKU_CUSTOM_VARIABLES'])
    recipe_name = custom_vars['recipename']  
    recipe = project.get_recipe(recipe_name)
    
    settings = recipe.get_settings()
    new_conf = [{'key': 'spark.sql.shuffle.partitions', 'value': '22'}]
    
    # update our recipe configuration programmatically
    settings.recipe_settings['params']['sparkConfig']['conf'] = new_conf
    settings.save()
    


    Let me know if this doesn't work for you!

    Thank you,
    Sarina

Answers

  • Sarina
    Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 317 Dataiker

    Hi @NikolayK
    ,

    It is not possible to alter the Spark settings of a recipe that is already running. The spark configuration set in the advanced tab of the recipe is the configuration will be used once the recipe is submitted as a job, and it cannot be altered after the recipe is triggered. The programmatic approach I outlined will only work if you are programmatically creating recipes and wish to add a spark configuration to the recipe via code prior to triggering the recipe.

    Is there something specific you are looking to achieve by configuring the spark configuration for a spark recipe that is already running? If you provide some additional details on your workflow we can see if there is another option that might make more sense.

    Thank you,
    Sarina

  • NikolayK
    NikolayK Partner, Registered Posts: 14 Partner

    Hi @SarinaS
    ,

    Thank you for your reply. Indeed, I realize it's not possible to change the settings for the ongoing execution, I rather want to change them for future executions.

    Specifically, I want to adjust spark.sql.shuffle.partitions depending on the number of executors available and I don't want my users to need to know about this. Can you recommend anything in that direction?

  • NikolayK
    NikolayK Partner, Registered Posts: 14 Partner

    Hi @SarinaS
    ,

    Thank you for your help! I very much prefer the second solution (using os) as it's much more readable. Unfortunately, it didn't work for me: there is no variable recipename defined. Here are the only fields that were in DKU_CUSTOM_VARIABLES: jobId, activityId, projectKey, dip.homejobProjectKey. I guess, I can parse it from activityId, though I heard there's a better way in DSS 9.

    However, the other solution made the trick, I am now able to modify the settings.

    Best regards,

    Nikolay.

Setup Info
    Tags
      Help me…