How to execute pyspark code from a scenario?

nmadhu20 · August 2021

Hi Dataiku Team,

I have a project specific requirement to execute a pyspark code through scenario.

I am aware that we have a pyspark recipe option but the code needs to dynamically read input datasets which cannot be set as the input of that recipe. As a workaround for this, we thought of adding the code into project's library and import the same into a scenario for execution.

But there is no option of selecting a pyspark engine while executing a step in scenario. We have explicit options of adding a sql/python step but not for pyspark.

Is there anyway to resolve this?

Any help is appreciated.

Alexandru · August 2021

If you need to read datasets from other flows the best approach is to use exposed objects. E.g https://doc.dataiku.com/dss/latest/security/exposed-objects.html for the dataset that reside in other projects.

Another option is be to add ignore_flow=True in the constructor of the Dataset() class. See https://community.dataiku.com/t5/Using-Dataiku-DSS/Can-not-import-Dataiku-dataset-in-Python-recipe-which-is-not-set/m-p/3232 for more information on why using shared datasets between projects is the preferred option.

Alexandru · August 2021

Hi,

You can create a pyspark recipe in the flow without an input dataset. You must however have at least 1 output dataset selected. Create a new recipe and select one output then you can execute your pyspark code. Then you can trigger this via the scenario. Would this satisfy your use case?

Screenshot 2021-08-23 at 11.21.16.png

nmadhu20 · August 2021

Hey Alex,

Thankyou for your reply.

I tried your approach but I'm still getting the same error. To give you more context - there are two types of input datasets that I need to read. One is present in the current flow(where recipe is present) and others resides in different flows which needs to be read dynamically one after another.

Enclosing the screenshot of error.

nmadhu20 · August 2021

Another option is be to add ignore_flow=True in the constructor of the Dataset() class. - This fix worked for me.

I am familiar with exposed objects but that approach does not fit the use case as this will be a central governing flow of sorts which would monitor all the other project's datasets which is huge.

How to execute pyspark code from a scenario?

Best Answer

Answers

Categories

Setup Info

Tags