How to execute pyspark code from a scenario?

Options
nmadhu20
nmadhu20 Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 35 Neuron

Hi Dataiku Team,

I have a project specific requirement to execute a pyspark code through scenario.

I am aware that we have a pyspark recipe option but the code needs to dynamically read input datasets which cannot be set as the input of that recipe. As a workaround for this, we thought of adding the code into project's library and import the same into a scenario for execution.

But there is no option of selecting a pyspark engine while executing a step in scenario. We have explicit options of adding a sql/python step but not for pyspark.

Is there anyway to resolve this?

Any help is appreciated.

Best Answer

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Options

    Hi,

    You can create a pyspark recipe in the flow without an input dataset. You must however have at least 1 output dataset selected. Create a new recipe and select one output then you can execute your pyspark code. Then you can trigger this via the scenario. Would this satisfy your use case?

    Screenshot 2021-08-23 at 11.21.16.png

  • nmadhu20
    nmadhu20 Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 35 Neuron
    Options

    Hey Alex,

    Thankyou for your reply.

    I tried your approach but I'm still getting the same error. To give you more context - there are two types of input datasets that I need to read. One is present in the current flow(where recipe is present) and others resides in different flows which needs to be read dynamically one after another.image.png

    Enclosing the screenshot of error.

  • nmadhu20
    nmadhu20 Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 35 Neuron
    Options

    Another option is be to add ignore_flow=True in the constructor of the Dataset() class. - This fix worked for me.

    I am familiar with exposed objects but that approach does not fit the use case as this will be a central governing flow of sorts which would monitor all the other project's datasets which is huge.

Setup Info
    Tags
      Help me…