Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi,
We have usecase that we need to create many pyspark recipes from different inputs and outputs. (1000+ recipes for the first time in Flow) and we want to implement a script that could automate the creation.
I check project.new_recipe() and Recipes section in developer guide. It doesn't have pyspark type in api documentation. Does this api support pyspark recipe creation?
If not, what would be the alternative way to do it?
Best,
Sung-Lin
Hi @SungLinChan ,
You can use the API in this case as well similar to python recipe you can just change it to pyspark for example:
https://developer.dataiku.com/latest/api-reference/python/recipes.html
import dataiku
client = dataiku.api_client()
project = client.get_project(dataiku.default_project_key())
builder = project.new_recipe("pyspark")
# Set the input
builder.with_input("sales")
# Create a new managed dataset for the output in the filesystem_managed connection
builder.with_new_output_dataset("sales_pyspark_2", "filesystem_managed")
# Set the code - builder is a PythonRecipeCreator, and has a ``with_script`` method
builder.with_script("""
# -*- coding: utf-8 -*-
import dataiku
from dataiku import spark as dkuspark
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
#adding this here
# Read recipe inputs
sales = dataiku.Dataset("sales")
sales_df = dkuspark.get_dataframe(sqlContext, sales)
# Compute recipe outputs from inputs
# TODO: Replace this part by your actual code that computes the output, as a SparkSQL dataframe
test_pyspark_df = sales_df # For this sample code, simply copy input to output
# Write recipe outputs
test_pyspark = dataiku.Dataset("test_pyspark")
dkuspark.write_with_schema(test_pyspark, test_pyspark_df)
""")
recipe = builder.create()
Hi @AlexT
Thanks for the snippet. It's clear. One follow up question. If I would like up set spark configuration and python environment by name when I create this recipe, how can I do it?
Best,
Sung-Lin
Thanks for the reply. I managed to write the script. However, I set the spark profile through 'get_recipe_params()" method, not 'get_recipe_raw_definition()'.
Best,
Sung-Lin
Hi @SungLinChan ,
https://developer.dataiku.com/latest/concepts-and-examples/recipes.html#setting-the-code-env-of-a-co...
import dataiku
#change code env
client = dataiku.api_client()
project = client.get_project(dataiku.default_project_key())
recipe = project.get_recipe("compute_sales_pyspark")
settings = recipe.get_settings()
# Use this to set the recipe to use a specific code env
settings.set_code_env(code_env="datascience")
print(settings)
settings.save()
#### Spark config
import dataiku
#change code env
client = dataiku.api_client()
project = client.get_project(dataiku.default_project_key())
recipe = project.get_recipe("compute_sales_pyspark")
settings = recipe.get_settings()
settings.get_recipe_raw_definition()['sparkConfig'] = {'inheritConf': 'spark-S-3-workers-of-1-CPU-3Gb-Ram', 'conf': []}
settings.save()