Keyword for creating sql spark recipe via Python API
Hi Team,
Could you please help me clear below doubts?
The use-case here is that the output of window recipe(s3) would be input to sql-spark recipe whose query we will dynamically create in python, add and run it from python. The output will also be s3 output.
builder = project.new_recipe(??) #what would be the keyword for sql_spark?
builder.with_input(output_df)
builder.with_new_output("name_of_output_dataset", "s3_connection_name")
recipe = builder.create()
recipe_settings.get_json_payload() #for adding query in the recipe which keyword is used?
ThankYou in advance.
Best Answer
-
Hello,
Can you explain a bit more why you need to create a sparksql recipe programmatically ? Having to create such recipes is not easy so there might be another simpler way to solve your issue. Anyway, you can do it but not from the new_recipe method. You'll need to use the create_recipe method. Which is harder to use. You can use the following code. If you need to find which values to use for specific parameters, such as a dataset's storage format for example, feel free to create such object via the UI, and then copy its settings.
%pylab inline import dataiku from dataiku import pandasutils as pdu import pandas as pd #Instanciate client client = dataiku.api_client() project = client.get_project("PROJECT_KEY") output_dataset_name = "output_dataset_name" input_dataset_name = "input_dataset_name" output_dataset_connection = 'connection_name' recipe_name = "compute_" + output_dataset_name #First, output dataset needs to be created project.create_dataset(output_dataset_name, formatType="csv", params={'connection': output_dataset_connection, 'filesSelectionRules': {'excludeRules': [], 'explicitFiles': [], 'includeRules': [], 'mode': 'ALL'}}, type="HDFS") #Then, we can create the recipe recipe_proto = {} recipe_proto["type"] = "spark_sql_query" recipe_proto["name"] = recipe_name recipe_proto["inputs"] = {'main': {'items': [{'appendMode': False, 'ref': input_dataset_name}]}} recipe_proto["outputs"] = {'main': {'items': [{'appendMode': False, 'ref': output_dataset_name}]}} creation_settings = { "useGlobalMetastore": False, "useGlobalMetastore": True, "forcePipelineableForTests": False, } spsql_recipe = project.create_recipe(recipe_proto, creation_settings) # Finally, modify the recipe's code recipe_def = spsql_recipe.get_definition_and_payload() recipe_def.set_payload("YOUR SQL CODE HERE") spsql_recipe.set_definition_and_payload(recipe_def)
Answers
-
Thanks for your reply.
I created a sql recipe instead of sql spark and yes it was much more straight forward and easier.
-
sdkayb Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 3 ✭✭✭
Thank you for your answer, it was very helpful for my case.
Additionally, I would like to have partitioned outputs. Typically, when creating a SparkSQL recipe using the UI, the output is automatically partitioned, but this doesn't seem to be the case when using the Dataiku API. Is there a way to achieve this using the Python API? Your assistance on this matter would be greatly appreciated.Thank you very much in advance.