Force Rebuild + Build Required Dataset vs Spark Pipeline

Farhan · August 2022

Assume I have below 6 datasets

source dataset

intermediate 1 dataset

intermediate 2 dataset

intermediate 3 dataset

intermediate 4 dataset

target dataset

These are the steps I want to set up in my scenario

1. Force rebuild intermediate 1 dataset

2. Build required dataset intermediate 3 dataset (it will also build intermediate 2)

3. Build required dataset target dataset (it will also build intermediate dataset 4)

Without spark pipeline, step #1 will build intermediate 1, step #2 will build intermediate 2 and 3 while step #3 will build intermediate 4 and target dataset

But with Spark pipeline, sometimes when it comes to step#3, it will rebuild intermediate dataset 3 as well which I think is redundant. Is there a way to avoid intermediate 3 dataset to be rebuilt? I have tried to disable 'Can this recipe be merged in an existing recipes pipeline?' but doesn't seem to work.
One way that I know will work is by setting intermediate 3 dataset as explicit build. But if you can suggest another way without using the explicit build, that would be great

Operating system used: Windows 10

Alexandru · August 2022

Hi,

What does the behavior look like now if you have Spark Pipelines enabled and you perform a recursive rebuild of the target dataset.

Can you try adding the option "Virtualizable in build on intermediate dataset 3 and see if this yields the behavior you are looking for?

https://doc.dataiku.com/dss/latest/spark/pipelines.html#configuring-behavior-for-intermediate-datasets

Thanks,

Alex

Farhan · August 2022

What does the behavior look like now if you have Spark Pipelines enabled and you perform a recursive rebuild of the target dataset
You mean manually do the forced recursive rebuild by right click on the target dataset?

Can you try adding the option "Virtualizable in build on intermediate dataset 3 and see if this yields the behavior you are looking for?
Have tried it out per your suggestion but did not work

Force Rebuild + Build Required Dataset vs Spark Pipeline

Answers

Categories

Setup Info

Tags