You now have until September 15th to submit your use case or success story to the 2022 Dataiku Frontrunner Awards!ENTER YOUR SUBMISSION

Spark pipeline in scenario

Solved!
julesbertrand
Level 2
Level 2
Spark pipeline in scenario

Hi all,

I am trying to use a spark pipeline between two datasets, which we will call Raw and Clean. I have multiple spark SQL recipes to build clean from Raw, and I want to do execute all of them in a scenario.

In order to save time and memory, I want to pipeline them, but I can't find a way to do it a Scenario. To be clear, I want to execute all recipes from Raw to Clean in a spark pipeline and be able to schedule this job. Is there a way to do so ?

Thank you

Jules

0 Kudos
1 Solution
AlexT
Dataiker
Dataiker

Hi,

If you add individual dataset to build it won’t use Spark pipelines as each step is executed in isolation. 

One thing you can try to mark to set the datasets you don’t want to rebuild as part of recursive build to explicit https://doc.dataiku.com/dss/latest/flow/building-datasets.html#preventing-a-dataset-from-being-built

Also more information on Spark pipelines here:

https://community.dataiku.com/t5/Using-Dataiku/Obtain-the-best-use-of-Spark-Pipeline/m-p/20974

View solution in original post

0 Kudos
3 Replies
AlexT
Dataiker
Dataiker

@julesbertrand ,

This would be done automatically if Spark pipelines are enabled and DSS is able to create a pipeline.

https://doc.dataiku.com/dss/latest/spark/pipelines.html

You can check whether a Spark pipeline has been created look at the job from the scenario.

Thanks,

 

 

0 Kudos
julesbertrand
Level 2
Level 2
Author

Hi Alex, 

If the only step in my scenario is "build the last dataset", it does create a spark pipeline as you said, however all prior datasets in the flow are rebuilt. I want to choose from where to where my scenario will build datasets.  And if I add multiple "build" steps in the scenario (one for each dataset or so), I don't have the spark pipeline...

Do you have any advise ,

Thank you

Jules

0 Kudos
AlexT
Dataiker
Dataiker

Hi,

If you add individual dataset to build it won’t use Spark pipelines as each step is executed in isolation. 

One thing you can try to mark to set the datasets you don’t want to rebuild as part of recursive build to explicit https://doc.dataiku.com/dss/latest/flow/building-datasets.html#preventing-a-dataset-from-being-built

Also more information on Spark pipelines here:

https://community.dataiku.com/t5/Using-Dataiku/Obtain-the-best-use-of-Spark-Pipeline/m-p/20974

0 Kudos