Spark pipeline in scenario

Solved!
julesbertrand
Level 2
Spark pipeline in scenario

Hi all,

I am trying to use a spark pipeline between two datasets, which we will call Raw and Clean. I have multiple spark SQL recipes to build clean from Raw, and I want to do execute all of them in a scenario.

In order to save time and memory, I want to pipeline them, but I can't find a way to do it a Scenario. To be clear, I want to execute all recipes from Raw to Clean in a spark pipeline and be able to schedule this job. Is there a way to do so ?

Thank you

Jules

0 Kudos
1 Solution
AlexT
Dataiker

Hi,

If you add individual dataset to build it wonโ€™t use Spark pipelines as each step is executed in isolation. 

One thing you can try to mark to set the datasets you donโ€™t want to rebuild as part of recursive build to explicit https://doc.dataiku.com/dss/latest/flow/building-datasets.html#preventing-a-dataset-from-being-built

Also more information on Spark pipelines here:

https://community.dataiku.com/t5/Using-Dataiku/Obtain-the-best-use-of-Spark-Pipeline/m-p/20974

View solution in original post

0 Kudos
3 Replies
AlexT
Dataiker

@julesbertrand ,

This would be done automatically if Spark pipelines are enabled and DSS is able to create a pipeline.

https://doc.dataiku.com/dss/latest/spark/pipelines.html

You can check whether a Spark pipeline has been created look at the job from the scenario.

Thanks,

 

 

0 Kudos
julesbertrand
Level 2
Author

Hi Alex, 

If the only step in my scenario is "build the last dataset", it does create a spark pipeline as you said, however all prior datasets in the flow are rebuilt. I want to choose from where to where my scenario will build datasets.  And if I add multiple "build" steps in the scenario (one for each dataset or so), I don't have the spark pipeline...

Do you have any advise ,

Thank you

Jules

0 Kudos
AlexT
Dataiker

Hi,

If you add individual dataset to build it wonโ€™t use Spark pipelines as each step is executed in isolation. 

One thing you can try to mark to set the datasets you donโ€™t want to rebuild as part of recursive build to explicit https://doc.dataiku.com/dss/latest/flow/building-datasets.html#preventing-a-dataset-from-being-built

Also more information on Spark pipelines here:

https://community.dataiku.com/t5/Using-Dataiku/Obtain-the-best-use-of-Spark-Pipeline/m-p/20974

0 Kudos