Best Practices For Updating and Renaming Spark and Container Configurations

kathyqingyuxu · January 2023

Hello Dataiku Community,

Hope all is well!

Our team is looking to implement new Spark and container configuration settings on our instances. We are curious to understand what the best practices are for updating the existing configurations. For context we have existing Spark configurations already being used by end users, however we would like to replace these existing settings with net new settings and naming conventions.

As a test we tried creating a net new Spark configuration on one of our dedicated "Dev" instances (a design node) and tested what would happen if we renamed the configuration. We saw that if we rename the configuration, all previous settings that were explicitly set to this new setting will be converted to "Nothing Selected". Please see example files for before and after pictures. Our before config was named "Large_9GBMem_11Exec_new" and we updated the name of the config to "Large_9GBMem_11Exec". However, in the after, we saw that the selection now is set to "Nothing Selected". Is there a way to have the selection default to the new name "Large_9GBMem_11Exec" for example or is this behavior expected?

I found the following below documentation and discussions through the community regarding systematically checking/updating the Spark settings via python script and wanted to confirm if it is the best practice to leverage the API to systematically update the Spark configs or if there is another way to update the configs/config names through the UI automatically.

Helpful Documentation/Discussions:

1. Spark configurations — Dataiku DSS 11 documentation

2. How to programmatically set the Spark Config - Dataiku Community

3. Api to get spark config of a recipe - Dataiku Community

4. Solved: Spark settings of the running recipe - Dataiku Community

5. Re: How to programmatically set the Spark Config - Dataiku Community

Appreciate the feedback!

Best,

Kathy

importthepandas · February 2023

Hi @kathyqingyuxu
we are in the same scenario as you. We've a few new spark configs and using the Python API to get and set SparkSQL and PySpark recipe steps is simple, akin to something like:

for r in recipes:
    recipe = proj.get_recipe(r['name'])
    sets = recipe.get_settings()
    
    if sets.type == 'pyspark':
        current = sets.recipe_settings['params']['sparkConfig']['inheritConf']
        print(recipe.name, current)
        if current == 'design-spark-rubix-small':
            sets.recipe_settings['params']['sparkConfig']['inheritConf'] = 'design-spark-small'
        elif current == 'design-spark':
            sets.recipe_settings['params']['sparkConfig']['inheritConf'] = 'design-spark-medium'
        sets.save()

However, when we run into visual recipes (joins, prepare, etc) i've noticed that many recipes don't have spark config metadata in their get_recipe_settings() or other areas, but are indeed configured at the GUI level.

importthepandas · February 2023

and after an answer from Dataiku support (seriously the best support team in the world) - here's steps for viz recipes:

import dataiku

client = dataiku.api_client()
proj = client.get_project("MY_PROJECT")
recipe = proj.get_recipe("MY_RECIPE")

sets = recipe.get_settings()
payload = sets.get_json_payload()

for spark_conf in payload["engineParams"]["sparkSQL"]["sparkConfig"]["conf"]:
    if spark_conf["key"] == "spark.example.foo":
        print("Updating Spark config:", spark_conf)
        spark_conf["value"] = "bar"
        sets.save()
        break

importthepandas · February 2023

thanks @ZachM
!

kathyqingyuxu · April 2023

Thanks for the information @importthepandas
!

I ended up modifying slightly on my end and ended up leveraging the following to get what I needed:

import dataiku

client = dataiku.api_client()
dss_projects = client.list_projects()

for project in dss_projects:
   project_obj = client.get_project(project['projectKey'])
   recipes = project_obj.list_recipes()
   for item in recipes:
      recipe = project_obj.get_recipe(item['name'])
      settings = recipe.get_settings()
      status = recipe.get_status()
      try:
         if status.get_selected_engine_details()["type"] == "SPARK":
            spark_settings = settings.get_recipe_params()

From there I was able to find the settings within spark_settings["engineParams"]["spark"]["readParam"]["sparkConfig"]["inheritConf"]

Hope this helps others

Best,

Kathy

Best Practices For Updating and Renaming Spark and Container Configurations

Answers

Categories

Setup Info

Tags