Create pyspark recipes with dataiku api
Hi,
We have usecase that we need to create many pyspark recipes from different inputs and outputs. (1000+ recipes for the first time in Flow) and we want to implement a script that could automate the creation.
I check project.new_recipe() and Recipes section in developer guide. It doesn't have pyspark type in api documentation. Does this api support pyspark recipe creation?
If not, what would be the alternative way to do it?
Best,
Sung-Lin
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi @SungLinChan
,
You can use the API in this case as well similar to python recipe you can just change it to pyspark for example:
https://developer.dataiku.com/latest/api-reference/python/recipes.htmlimport dataiku client = dataiku.api_client() project = client.get_project(dataiku.default_project_key()) builder = project.new_recipe("pyspark") # Set the input builder.with_input("sales") # Create a new managed dataset for the output in the filesystem_managed connection builder.with_new_output_dataset("sales_pyspark_2", "filesystem_managed") # Set the code - builder is a PythonRecipeCreator, and has a ``with_script`` method builder.with_script(""" # -*- coding: utf-8 -*- import dataiku from dataiku import spark as dkuspark from pyspark import SparkContext from pyspark.sql import SQLContext sc = SparkContext.getOrCreate() sqlContext = SQLContext(sc) #adding this here # Read recipe inputs sales = dataiku.Dataset("sales") sales_df = dkuspark.get_dataframe(sqlContext, sales) # Compute recipe outputs from inputs # TODO: Replace this part by your actual code that computes the output, as a SparkSQL dataframe test_pyspark_df = sales_df # For this sample code, simply copy input to output # Write recipe outputs test_pyspark = dataiku.Dataset("test_pyspark") dkuspark.write_with_schema(test_pyspark, test_pyspark_df) """) recipe = builder.create()
-
Hi @AlexT
Thanks for the snippet. It's clear. One follow up question. If I would like up set spark configuration and python environment by name when I create this recipe, how can I do it?
Best,
Sung-Lin
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi @SungLinChan
,
https://developer.dataiku.com/latest/concepts-and-examples/recipes.html#setting-the-code-env-of-a-code-recipeimport dataiku #change code env client = dataiku.api_client() project = client.get_project(dataiku.default_project_key()) recipe = project.get_recipe("compute_sales_pyspark") settings = recipe.get_settings() # Use this to set the recipe to use a specific code env settings.set_code_env(code_env="datascience") print(settings) settings.save() #### Spark config import dataiku #change code env client = dataiku.api_client() project = client.get_project(dataiku.default_project_key()) recipe = project.get_recipe("compute_sales_pyspark") settings = recipe.get_settings() settings.get_recipe_raw_definition()['sparkConfig'] = {'inheritConf': 'spark-S-3-workers-of-1-CPU-3Gb-Ram', 'conf': []} settings.save()
-
Thanks for the reply. I managed to write the script. However, I set the spark profile through 'get_recipe_params()" method, not 'get_recipe_raw_definition()'.
Best,
Sung-Lin