Create pyspark recipes with dataiku api

SungLinChan · ‎10-10-2023

Hi,

We have usecase that we need to create many pyspark recipes from different inputs and outputs. (1000+ recipes for the first time in Flow) and we want to implement a script that could automate the creation.

I check project.new_recipe() and Recipes section in developer guide. It doesn't have pyspark type in api documentation. Does this api support pyspark recipe creation?

If not, what would be the alternative way to do it?

Best,

Sung-Lin

AlexT · ‎10-11-2023

Hi @SungLinChan ,

You can use the API in this case as well similar to python recipe you can just change it to pyspark for example:

https://developer.dataiku.com/latest/api-reference/python/recipes.html

import dataiku

client = dataiku.api_client()
project = client.get_project(dataiku.default_project_key())

builder = project.new_recipe("pyspark")

# Set the input
builder.with_input("sales")
# Create a new managed dataset for the output in the filesystem_managed connection
builder.with_new_output_dataset("sales_pyspark_2", "filesystem_managed")

# Set the code - builder is a PythonRecipeCreator, and has a ``with_script`` method
builder.with_script("""
# -*- coding: utf-8 -*-
import dataiku
from dataiku import spark as dkuspark
from pyspark import SparkContext
from pyspark.sql import SQLContext

sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)

#adding this here


# Read recipe inputs
sales = dataiku.Dataset("sales")
sales_df = dkuspark.get_dataframe(sqlContext, sales)

# Compute recipe outputs from inputs
# TODO: Replace this part by your actual code that computes the output, as a SparkSQL dataframe
test_pyspark_df = sales_df # For this sample code, simply copy input to output

# Write recipe outputs
test_pyspark = dataiku.Dataset("test_pyspark")
dkuspark.write_with_schema(test_pyspark, test_pyspark_df)
""")

recipe = builder.create()

SungLinChan · ‎10-11-2023

Hi @AlexT

Thanks for the snippet. It's clear. One follow up question. If I would like up set spark configuration and python environment by name when I create this recipe, how can I do it?

Best,

Sung-Lin

SungLinChan · ‎10-12-2023

@AlexT

Thanks for the reply. I managed to write the script. However, I set the spark profile through 'get_recipe_params()" method, not 'get_recipe_raw_definition()'.

Best,

Sung-Lin

AlexT · ‎10-12-2023

Hi @SungLinChan ,

https://developer.dataiku.com/latest/concepts-and-examples/recipes.html#setting-the-code-env-of-a-co...

import dataiku
#change code env 
client = dataiku.api_client()
project = client.get_project(dataiku.default_project_key())

recipe = project.get_recipe("compute_sales_pyspark")
settings = recipe.get_settings()

# Use this to set the recipe to use a specific code env
settings.set_code_env(code_env="datascience")
print(settings)
settings.save()


#### Spark config 

import dataiku
#change code env 
client = dataiku.api_client()
project = client.get_project(dataiku.default_project_key())

recipe = project.get_recipe("compute_sales_pyspark")
settings = recipe.get_settings()
settings.get_recipe_raw_definition()['sparkConfig'] = {'inheritConf': 'spark-S-3-workers-of-1-CPU-3Gb-Ram', 'conf': []}
settings.save()

Sign up to take part

Create pyspark recipes with dataiku api

Create pyspark recipes with dataiku api

Setup info