Obtain the best use of Spark Pipeline
Hello Dataiku Team,
We are testing the Spark Pipeline feature in our flows. The feature seems amazing and we have some questions:
1- If we have a flow can runs many jobs in parallel. Does creating normal recursive jobs better than spark pipeline ?
2- How do we tune our flows to make best of use of spark pipeline ?
3- Can we select part of flow to build using spark pipeline ? or it always need to build from beginning.
Thanks
Best Answer
-
Keiji Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 52 Dataiker
Hello @Bader
,Thank you so much for posting the questions regarding the Spark pipelines feature.
> 1- If we have a flow can runs many jobs in parallel. Does creating normal recursive jobs better than spark pipeline ?
The Spark pipelines feature improves the performance of a DSS job by skipping the read and write processing of intermediate datasets. Consecutive recipes in a DSS flow can be merged into a single Spark pipeline (Spark job) and the performance of these recipes can be improved by avoiding the reads and writes of intermediate datasets in the pipeline.
So, if your DSS flow has a lot of intermediate datasets that do not necessarily need to have actual data, enabling the Spark pipeline feature would be a better option for you.
In addition, you can run multiple Spark pipelines for the execution of your flow. DSS will automatically decide which recipes should be executed within the same Spark pipeline, but you can also manually control that. On a recipe's "Advanced" page, there is a property of `Can this recipe be merged in an existing recipes pipeline?`. By disabling this property, you can prevent this recipe from being merged in an existing Spark pipeline as follows:
So, you can still execute multiple Spark pipelines (multiple Spark jobs) simultaneously while utilizing the Spark pipelines feature by dividing your flow into multiple Spark pipelines rather than executing a single Spark pipeline.
> 2- How do we tune our flows to make best of use of spark pipeline ?
As I mentioned previously, the advantage of the Spark pipelines feature is that the performance of a DSS job can be improved by skipping the read and write processing of intermediate datasets. You can skip the reads and writes of an intermediate dataset by enabling the "Virtualizable in build" property in the dataset's "Settings > Advanced" page as follows:
> 3- Can we select part of flow to build using spark pipeline ? or it always need to build from beginning.
It will depend on your settings. If you enable the "Virtualizable in build" property for all of the intermediate datasets, your Spark pipelines will always need to access the source datasets. If you have intermediate datasets which have actual data, your Spark pipelines will able to start from such intermediate datasets rather than always starting from the source datasets.
Please see this DSS documentation https://doc.dataiku.com/dss/latest/spark/pipelines.html for the details of the Spark pipelines features.
I hope this would help. Please let us know if you have any further questions regarding the Spark pipelines feature.
Sincerely,
Keiji, Dataiku Technical Support