How to execute pyspark code from a scenario?

Solved!
nmadhu20
How to execute pyspark code from a scenario?

Hi Dataiku Team,

I have a project specific requirement to execute a pyspark code through scenario.

I am aware that we have a pyspark recipe option but the code needs to dynamically read input datasets which cannot be set as the input of that recipe. As a workaround for this, we thought of adding the code into project's library and import the same into a scenario for execution.

But there is no option of selecting a pyspark engine while executing a step in scenario. We have explicit options of adding a sql/python step but not for pyspark.

 

Is there anyway to resolve this? 

Any help is appreciated.

0 Kudos
1 Solution
AlexT
Dataiker

If you need to read datasets from other flows the best approach is to use exposed objects. E.g https://doc.dataiku.com/dss/latest/security/exposed-objects.html for the dataset that reside in other projects.

Another option is be to add ignore_flow=True in the constructor of the Dataset() class. See https://community.dataiku.com/t5/Using-Dataiku-DSS/Can-not-import-Dataiku-dataset-in-Python-recipe-w... for more information on why using shared datasets between projects is the preferred option. 

View solution in original post

4 Replies
AlexT
Dataiker

Hi,

You can create a pyspark recipe in the flow without an input dataset. You must however have at least 1 output dataset selected. Create a new recipe and select one output then you can execute your pyspark code. Then you can trigger this via the scenario. Would this satisfy your use case? 

Screenshot 2021-08-23 at 11.21.16.png

 

0 Kudos
nmadhu20
Author

Hey Alex,

Thankyou for your reply.

I tried your approach but I'm still getting the same error. To give you more context - there are two types of input datasets that I need to read. One is present in the current flow(where recipe is present) and others resides in different flows which needs to be read dynamically one after another.image.png

Enclosing the screenshot of error.

0 Kudos
AlexT
Dataiker

If you need to read datasets from other flows the best approach is to use exposed objects. E.g https://doc.dataiku.com/dss/latest/security/exposed-objects.html for the dataset that reside in other projects.

Another option is be to add ignore_flow=True in the constructor of the Dataset() class. See https://community.dataiku.com/t5/Using-Dataiku-DSS/Can-not-import-Dataiku-dataset-in-Python-recipe-w... for more information on why using shared datasets between projects is the preferred option. 

nmadhu20
Author

Another option is be to add ignore_flow=True in the constructor of the Dataset() class. - This fix worked for me.

I am familiar with exposed objects but that approach does not fit the use case as this will be a central governing flow of sorts which would monitor all the other project's datasets which is huge.

0 Kudos