How to execute pyspark code from a scenario?
Hi Dataiku Team,
I have a project specific requirement to execute a pyspark code through scenario.
I am aware that we have a pyspark recipe option but the code needs to dynamically read input datasets which cannot be set as the input of that recipe. As a workaround for this, we thought of adding the code into project's library and import the same into a scenario for execution.
But there is no option of selecting a pyspark engine while executing a step in scenario. We have explicit options of adding a sql/python step but not for pyspark.
Is there anyway to resolve this?
Any help is appreciated.
Best Answer
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,227 Dataiker
If you need to read datasets from other flows the best approach is to use exposed objects. E.g https://doc.dataiku.com/dss/latest/security/exposed-objects.html for the dataset that reside in other projects.
Another option is be to add ignore_flow=True in the constructor of the Dataset() class. See https://community.dataiku.com/t5/Using-Dataiku-DSS/Can-not-import-Dataiku-dataset-in-Python-recipe-which-is-not-set/m-p/3232 for more information on why using shared datasets between projects is the preferred option.
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,227 Dataiker
Hi,
You can create a pyspark recipe in the flow without an input dataset. You must however have at least 1 output dataset selected. Create a new recipe and select one output then you can execute your pyspark code. Then you can trigger this via the scenario. Would this satisfy your use case?
-
Hey Alex,
Thankyou for your reply.
I tried your approach but I'm still getting the same error. To give you more context - there are two types of input datasets that I need to read. One is present in the current flow(where recipe is present) and others resides in different flows which needs to be read dynamically one after another.
Enclosing the screenshot of error.
-
Another option is be to add ignore_flow=True in the constructor of the Dataset() class. - This fix worked for me.
I am familiar with exposed objects but that approach does not fit the use case as this will be a central governing flow of sorts which would monitor all the other project's datasets which is huge.