Create pyspark recipes with dataiku api

Options
SungLinChan
SungLinChan Registered Posts: 5

Hi,

We have usecase that we need to create many pyspark recipes from different inputs and outputs. (1000+ recipes for the first time in Flow) and we want to implement a script that could automate the creation.

I check project.new_recipe() and Recipes section in developer guide. It doesn't have pyspark type in api documentation. Does this api support pyspark recipe creation?

If not, what would be the alternative way to do it?

Best,

Sung-Lin

Tagged:

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Options

    Hi @SungLinChan
    ,

    You can use the API in this case as well similar to python recipe you can just change it to pyspark for example:

    https://developer.dataiku.com/latest/api-reference/python/recipes.html

    import dataikuclient = dataiku.api_client()project = client.get_project(dataiku.default_project_key())builder = project.new_recipe("pyspark")# Set the inputbuilder.with_input("sales")# Create a new managed dataset for the output in the filesystem_managed connectionbuilder.with_new_output_dataset("sales_pyspark_2", "filesystem_managed")# Set the code - builder is a PythonRecipeCreator, and has a ``with_script`` methodbuilder.with_script("""# -*- coding: utf-8 -*-import dataikufrom dataiku import spark as dkusparkfrom pyspark import SparkContextfrom pyspark.sql import SQLContextsc = SparkContext.getOrCreate()sqlContext = SQLContext(sc)#adding this here# Read recipe inputssales = dataiku.Dataset("sales")sales_df = dkuspark.get_dataframe(sqlContext, sales)# Compute recipe outputs from inputs# TODO: Replace this part by your actual code that computes the output, as a SparkSQL dataframetest_pyspark_df = sales_df # For this sample code, simply copy input to output# Write recipe outputstest_pyspark = dataiku.Dataset("test_pyspark")dkuspark.write_with_schema(test_pyspark, test_pyspark_df)""")recipe = builder.create()
  • SungLinChan
    SungLinChan Registered Posts: 5
    Options

    Hi @AlexT

    Thanks for the snippet. It's clear. One follow up question. If I would like up set spark configuration and python environment by name when I create this recipe, how can I do it?

    Best,

    Sung-Lin

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Options

    Hi @SungLinChan
    ,

    https://developer.dataiku.com/latest/concepts-and-examples/recipes.html#setting-the-code-env-of-a-code-recipe

    import dataiku#change code envclient = dataiku.api_client()project = client.get_project(dataiku.default_project_key())recipe = project.get_recipe("compute_sales_pyspark")settings = recipe.get_settings()# Use this to set the recipe to use a specific code envsettings.set_code_env(code_env="datascience")print(settings)settings.save()#### Spark configimport dataiku#change code envclient = dataiku.api_client()project = client.get_project(dataiku.default_project_key())recipe = project.get_recipe("compute_sales_pyspark")settings = recipe.get_settings()settings.get_recipe_raw_definition()['sparkConfig'] = {'inheritConf': 'spark-S-3-workers-of-1-CPU-3Gb-Ram', 'conf': []}settings.save()

  • SungLinChan
    SungLinChan Registered Posts: 5
    Options

    @AlexT

    Thanks for the reply. I managed to write the script. However, I set the spark profile through 'get_recipe_params()" method, not 'get_recipe_raw_definition()'.

    Best,

    Sung-Lin

Setup Info
    Tags
      Help me…