Discover this year's submissions to the Dataiku Frontrunner Awards and give kudos to your favorite use cases and success stories!READ MORE

Keyword for creating sql spark recipe via Python API

Solved!
nmadhu20
Neuron
Neuron
Keyword for creating sql spark recipe via Python API

Hi Team, 

Could you please help me clear below doubts?

The use-case here is that the output of window recipe(s3) would be input to sql-spark recipe whose query we will dynamically create in python, add and run it from python. The output will also be s3 output.

builder = project.new_recipe(??) #what would be the keyword for sql_spark?
builder.with_input(output_df)
builder.with_new_output("name_of_output_dataset", "s3_connection_name")
recipe = builder.create()

recipe_settings.get_json_payload() #for adding  query in the recipe which keyword is used?

 

image.png

 

 

 

 

 

ThankYou in advance.

0 Kudos
1 Solution
AlexandreL
Dataiker
Dataiker

Hello,

Can you explain a bit more why you need to create a sparksql recipe programmatically ? Having to create such recipes is not easy so there might be another simpler way to solve your issue. Anyway, you can do it but not from the new_recipe method. You'll need to use the create_recipe method. Which is harder to use. You can use the following code. If you need to find which values to use for specific parameters, such as a dataset's storage format for example, feel free to create such object via the UI, and then copy its settings.

%pylab inline

import dataiku
from dataiku import pandasutils as pdu
import pandas as pd

#Instanciate client
client = dataiku.api_client()

project = client.get_project("PROJECT_KEY")
output_dataset_name = "output_dataset_name"
input_dataset_name = "input_dataset_name"
output_dataset_connection = 'connection_name'
recipe_name = "compute_" + output_dataset_name

#First, output dataset needs to be created
project.create_dataset(output_dataset_name, formatType="csv", params={'connection': output_dataset_connection,
  'filesSelectionRules': {'excludeRules': [],
   'explicitFiles': [],
   'includeRules': [],
   'mode': 'ALL'}}, type="HDFS")

#Then, we can create the recipe
recipe_proto = {}
recipe_proto["type"] = "spark_sql_query"
recipe_proto["name"] = recipe_name
recipe_proto["inputs"] = {'main': {'items': [{'appendMode': False,
     'ref': input_dataset_name}]}}
recipe_proto["outputs"] = {'main': {'items': [{'appendMode': False,
     'ref': output_dataset_name}]}}

creation_settings = {
    "useGlobalMetastore": False,
    "useGlobalMetastore": True,
    "forcePipelineableForTests": False,  
}

spsql_recipe = project.create_recipe(recipe_proto, creation_settings)

# Finally, modify the recipe's code
recipe_def = spsql_recipe.get_definition_and_payload()
recipe_def.set_payload("YOUR SQL CODE HERE")
spsql_recipe.set_definition_and_payload(recipe_def)

View solution in original post

0 Kudos
2 Replies
AlexandreL
Dataiker
Dataiker

Hello,

Can you explain a bit more why you need to create a sparksql recipe programmatically ? Having to create such recipes is not easy so there might be another simpler way to solve your issue. Anyway, you can do it but not from the new_recipe method. You'll need to use the create_recipe method. Which is harder to use. You can use the following code. If you need to find which values to use for specific parameters, such as a dataset's storage format for example, feel free to create such object via the UI, and then copy its settings.

%pylab inline

import dataiku
from dataiku import pandasutils as pdu
import pandas as pd

#Instanciate client
client = dataiku.api_client()

project = client.get_project("PROJECT_KEY")
output_dataset_name = "output_dataset_name"
input_dataset_name = "input_dataset_name"
output_dataset_connection = 'connection_name'
recipe_name = "compute_" + output_dataset_name

#First, output dataset needs to be created
project.create_dataset(output_dataset_name, formatType="csv", params={'connection': output_dataset_connection,
  'filesSelectionRules': {'excludeRules': [],
   'explicitFiles': [],
   'includeRules': [],
   'mode': 'ALL'}}, type="HDFS")

#Then, we can create the recipe
recipe_proto = {}
recipe_proto["type"] = "spark_sql_query"
recipe_proto["name"] = recipe_name
recipe_proto["inputs"] = {'main': {'items': [{'appendMode': False,
     'ref': input_dataset_name}]}}
recipe_proto["outputs"] = {'main': {'items': [{'appendMode': False,
     'ref': output_dataset_name}]}}

creation_settings = {
    "useGlobalMetastore": False,
    "useGlobalMetastore": True,
    "forcePipelineableForTests": False,  
}

spsql_recipe = project.create_recipe(recipe_proto, creation_settings)

# Finally, modify the recipe's code
recipe_def = spsql_recipe.get_definition_and_payload()
recipe_def.set_payload("YOUR SQL CODE HERE")
spsql_recipe.set_definition_and_payload(recipe_def)
0 Kudos
nmadhu20
Neuron
Neuron
Author

Thanks for your reply.

I created a sql recipe instead of sql spark and yes it was much more straight forward and easier.

0 Kudos