Ready for Dataiku 10? Try out the Crash Course on new features!GET STARTED

Spark settings of the running recipe

Solved!
NikolayK
Level 1
Level 1
Spark settings of the running recipe

I would like to change Spark recipe global settings in DSS 9 according to the way proposed in How to programmatically set the Spark Config .

However, I'd like to do it from inside a running recipe. How do I get a reference to it in Python without having to hardcode the project and recipe keys like suggested in the referenced discussion?

project = client.get_project('PROJECT_KEY')
recipe = project.get_recipe('RECIPE_KEY')

settings = recipe.get_settings()

 

 

0 Kudos
1 Solution
SarinaS
Dataiker
Dataiker

Hi @NikolayK,

Interesting, so you will just set the partitions for all future executions of the recipe? 

If so I think that is ok, as long as you know that the current run will never be influenced by the recipe modification. In that case, you can do something like the following. For the project, you can use client.get_default_project() to get the current project:

import dataiku
from dataiku import pandasutils as pdu
from dataiku import spark as dkuspark
from pyspark import SparkContext
import pandas as pd

client = dataiku.api_client()
project = client.get_default_project()

sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)

 
And then for the current recipe, you can either use the os package and parse the recipe name from your os.environ, or you can parse the recipe name from the sc.appName field. Here's an example of the latter:

# where sc.appName returns DSS (Py): recipe_activity
recipe_name =  '_'.join((sc.appName.split(':')[1].split('_'))[:-1])
recipe = project.get_recipe(recipe_name)
settings = recipe.get_settings()
new_conf = [{'key': 'spark.sql.shuffle.partitions', 'value': '22'}]

settings.recipe_settings['params']['sparkConfig']['conf'] = new_conf
settings.save()


Or using the os environ:

import os 
import json 

custom_vars = json.loads(os.environ['DKU_CUSTOM_VARIABLES'])
recipe_name = custom_vars['recipename']  
recipe = project.get_recipe(recipe_name)

settings = recipe.get_settings()
new_conf = [{'key': 'spark.sql.shuffle.partitions', 'value': '22'}]

# update our recipe configuration programmatically
settings.recipe_settings['params']['sparkConfig']['conf'] = new_conf
settings.save()


Let me know if this doesn't work for you!

Thank you,
Sarina

View solution in original post

4 Replies
SarinaS
Dataiker
Dataiker

Hi @NikolayK,

It is not possible to alter the Spark settings of a recipe that is already running. The spark configuration set in the advanced tab of the recipe is the configuration will be used once the recipe is submitted as a job, and it cannot be altered after the recipe is triggered. The programmatic approach I outlined will only work if you are programmatically creating recipes and wish to add a spark configuration to the recipe via code prior to triggering the recipe. 

Is there something specific you are looking to achieve by configuring the spark configuration for a spark recipe that is already running? If you provide some additional details on your workflow we can see if there is another option that might make more sense. 

Thank you,
Sarina

0 Kudos
NikolayK
Level 1
Level 1
Author

Hi @SarinaS ,

Thank you for your reply. Indeed, I realize it's not possible to change the settings for the ongoing execution, I rather want to change them for future executions.

Specifically, I want to adjust spark.sql.shuffle.partitions depending on the number of executors available and I don't want my users to need to know about this. Can you recommend anything in that direction?

0 Kudos
SarinaS
Dataiker
Dataiker

Hi @NikolayK,

Interesting, so you will just set the partitions for all future executions of the recipe? 

If so I think that is ok, as long as you know that the current run will never be influenced by the recipe modification. In that case, you can do something like the following. For the project, you can use client.get_default_project() to get the current project:

import dataiku
from dataiku import pandasutils as pdu
from dataiku import spark as dkuspark
from pyspark import SparkContext
import pandas as pd

client = dataiku.api_client()
project = client.get_default_project()

sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)

 
And then for the current recipe, you can either use the os package and parse the recipe name from your os.environ, or you can parse the recipe name from the sc.appName field. Here's an example of the latter:

# where sc.appName returns DSS (Py): recipe_activity
recipe_name =  '_'.join((sc.appName.split(':')[1].split('_'))[:-1])
recipe = project.get_recipe(recipe_name)
settings = recipe.get_settings()
new_conf = [{'key': 'spark.sql.shuffle.partitions', 'value': '22'}]

settings.recipe_settings['params']['sparkConfig']['conf'] = new_conf
settings.save()


Or using the os environ:

import os 
import json 

custom_vars = json.loads(os.environ['DKU_CUSTOM_VARIABLES'])
recipe_name = custom_vars['recipename']  
recipe = project.get_recipe(recipe_name)

settings = recipe.get_settings()
new_conf = [{'key': 'spark.sql.shuffle.partitions', 'value': '22'}]

# update our recipe configuration programmatically
settings.recipe_settings['params']['sparkConfig']['conf'] = new_conf
settings.save()


Let me know if this doesn't work for you!

Thank you,
Sarina

View solution in original post

NikolayK
Level 1
Level 1
Author

Hi @SarinaS ,

Thank you for your help! I very much prefer the second solution (using os) as it's much more readable. Unfortunately, it didn't work for me: there is no variable recipename defined. Here are the only fields that were in DKU_CUSTOM_VARIABLES: jobId, activityId, projectKey, dip.homejobProjectKey. I guess, I can parse it from activityId, though I heard there's a better way in DSS 9.

However, the other solution made the trick, I am now able to modify the settings.

Best regards,

Nikolay.

0 Kudos

Labels

?
Labels (2)
A banner prompting to get Dataiku DSS