Spark pipeline in scenario
Hi all,
I am trying to use a spark pipeline between two datasets, which we will call Raw and Clean. I have multiple spark SQL recipes to build clean from Raw, and I want to do execute all of them in a scenario.
In order to save time and memory, I want to pipeline them, but I can't find a way to do it a Scenario. To be clear, I want to execute all recipes from Raw to Clean in a spark pipeline and be able to schedule this job. Is there a way to do so ?
Thank you
Jules
Best Answer
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi,
If you add individual dataset to build it won’t use Spark pipelines as each step is executed in isolation.
One thing you can try to mark to set the datasets you don’t want to rebuild as part of recursive build to explicit https://doc.dataiku.com/dss/latest/flow/building-datasets.html#preventing-a-dataset-from-being-built
Also more information on Spark pipelines here:
https://community.dataiku.com/t5/Using-Dataiku/Obtain-the-best-use-of-Spark-Pipeline/m-p/20974
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
This would be done automatically if Spark pipelines are enabled and DSS is able to create a pipeline.
https://doc.dataiku.com/dss/latest/spark/pipelines.html
You can check whether a Spark pipeline has been created look at the job from the scenario.
Thanks,
-
julesbertrand Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 9 Partner
Hi Alex,
If the only step in my scenario is "build the last dataset", it does create a spark pipeline as you said, however all prior datasets in the flow are rebuilt. I want to choose from where to where my scenario will build datasets. And if I add multiple "build" steps in the scenario (one for each dataset or so), I don't have the spark pipeline...
Do you have any advise ,
Thank you
Jules