Create pyspark recipes with dataiku api

Options
SungLinChan
SungLinChan Registered Posts: 5

Hi,

We have usecase that we need to create many pyspark recipes from different inputs and outputs. (1000+ recipes for the first time in Flow) and we want to implement a script that could automate the creation.

I check project.new_recipe() and Recipes section in developer guide. It doesn't have pyspark type in api documentation. Does this api support pyspark recipe creation?

If not, what would be the alternative way to do it?

Best,

Sung-Lin

Tagged:

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    edited July 17
    Options

    Hi @SungLinChan
    ,

    You can use the API in this case as well similar to python recipe you can just change it to pyspark for example:

    https://developer.dataiku.com/latest/api-reference/python/recipes.html

    import dataiku
    
    client = dataiku.api_client()
    project = client.get_project(dataiku.default_project_key())
    
    builder = project.new_recipe("pyspark")
    
    # Set the input
    builder.with_input("sales")
    # Create a new managed dataset for the output in the filesystem_managed connection
    builder.with_new_output_dataset("sales_pyspark_2", "filesystem_managed")
    
    # Set the code - builder is a PythonRecipeCreator, and has a ``with_script`` method
    builder.with_script("""
    # -*- coding: utf-8 -*-
    import dataiku
    from dataiku import spark as dkuspark
    from pyspark import SparkContext
    from pyspark.sql import SQLContext
    
    sc = SparkContext.getOrCreate()
    sqlContext = SQLContext(sc)
    
    #adding this here
    
    
    # Read recipe inputs
    sales = dataiku.Dataset("sales")
    sales_df = dkuspark.get_dataframe(sqlContext, sales)
    
    # Compute recipe outputs from inputs
    # TODO: Replace this part by your actual code that computes the output, as a SparkSQL dataframe
    test_pyspark_df = sales_df # For this sample code, simply copy input to output
    
    # Write recipe outputs
    test_pyspark = dataiku.Dataset("test_pyspark")
    dkuspark.write_with_schema(test_pyspark, test_pyspark_df)
    """)
    
    recipe = builder.create()
  • SungLinChan
    SungLinChan Registered Posts: 5
    Options

    Hi @AlexT

    Thanks for the snippet. It's clear. One follow up question. If I would like up set spark configuration and python environment by name when I create this recipe, how can I do it?

    Best,

    Sung-Lin

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    edited July 17
    Options

    Hi @SungLinChan
    ,

    https://developer.dataiku.com/latest/concepts-and-examples/recipes.html#setting-the-code-env-of-a-code-recipe

    import dataiku
    #change code env 
    client = dataiku.api_client()
    project = client.get_project(dataiku.default_project_key())
    
    recipe = project.get_recipe("compute_sales_pyspark")
    settings = recipe.get_settings()
    
    # Use this to set the recipe to use a specific code env
    settings.set_code_env(code_env="datascience")
    print(settings)
    settings.save()
    
    
    #### Spark config 
    
    import dataiku
    #change code env 
    client = dataiku.api_client()
    project = client.get_project(dataiku.default_project_key())
    
    recipe = project.get_recipe("compute_sales_pyspark")
    settings = recipe.get_settings()
    settings.get_recipe_raw_definition()['sparkConfig'] = {'inheritConf': 'spark-S-3-workers-of-1-CPU-3Gb-Ram', 'conf': []}
    settings.save()
    

  • SungLinChan
    SungLinChan Registered Posts: 5
    Options

    @AlexT

    Thanks for the reply. I managed to write the script. However, I set the spark profile through 'get_recipe_params()" method, not 'get_recipe_raw_definition()'.

    Best,

    Sung-Lin

Setup Info
    Tags
      Help me…