Spark pipeline in scenario

julesbertrand · April 2022

Hi all,

I am trying to use a spark pipeline between two datasets, which we will call Raw and Clean. I have multiple spark SQL recipes to build clean from Raw, and I want to do execute all of them in a scenario.

In order to save time and memory, I want to pipeline them, but I can't find a way to do it a Scenario. To be clear, I want to execute all recipes from Raw to Clean in a spark pipeline and be able to schedule this job. Is there a way to do so ?

Thank you

Jules

Alexandru · April 2022

Hi,

If you add individual dataset to build it won’t use Spark pipelines as each step is executed in isolation.

One thing you can try to mark to set the datasets you don’t want to rebuild as part of recursive build to explicit https://doc.dataiku.com/dss/latest/flow/building-datasets.html#preventing-a-dataset-from-being-built

Also more information on Spark pipelines here:

https://community.dataiku.com/t5/Using-Dataiku/Obtain-the-best-use-of-Spark-Pipeline/m-p/20974

Alexandru · April 2022

@julesbertrand
,

This would be done automatically if Spark pipelines are enabled and DSS is able to create a pipeline.

https://doc.dataiku.com/dss/latest/spark/pipelines.html

You can check whether a Spark pipeline has been created look at the job from the scenario.

Thanks,

julesbertrand · April 2022

Hi Alex,

If the only step in my scenario is "build the last dataset", it does create a spark pipeline as you said, however all prior datasets in the flow are rebuilt. I want to choose from where to where my scenario will build datasets. And if I add multiple "build" steps in the scenario (one for each dataset or so), I don't have the spark pipeline...

Do you have any advise ,

Thank you

Jules

Spark pipeline in scenario

Best Answer

Answers

Categories

Setup Info

Tags