Discover this year's submissions to the Dataiku Frontrunner Awards and give kudos to your favorite use cases and success stories!READ MORE

Force Rebuild + Build Required Dataset vs Spark Pipeline

farhanromli
Level 3
Force Rebuild + Build Required Dataset vs Spark Pipeline

Assume I have below 6 datasets

source dataset

intermediate 1 dataset

intermediate 2 dataset

intermediate 3 dataset 

intermediate 4 dataset 

target dataset 

These are the steps  I want to set up in my scenario

1. Force rebuild intermediate 1 dataset

2. Build required dataset intermediate 3 dataset (it will also build intermediate 2)

3. Build required dataset target dataset (it will also build intermediate dataset 4)

Without spark pipeline, step #1 will build intermediate 1, step #2 will build intermediate 2 and 3 while step #3 will build intermediate 4 and target dataset

But with Spark pipeline, sometimes when it comes to step#3, it will rebuild intermediate dataset 3 as well which I think is redundant. Is there a way to avoid intermediate 3 dataset to be rebuilt? I have tried to disable 'Can this recipe be merged in an existing recipes pipeline?'  but doesn't seem to work.
One way that I know will work is by setting intermediate 3 dataset as explicit build. But if you can suggest another way without using the explicit build, that would be great


Operating system used: Windows 10

0 Kudos
2 Replies
AlexT
Dataiker
Dataiker

Hi,

What does the behavior look like now if you have Spark Pipelines enabled and you perform a recursive rebuild of the target dataset. 

Can you try adding the option "Virtualizable in build on intermediate dataset 3 and see if this yields the behavior you are looking for? 

https://doc.dataiku.com/dss/latest/spark/pipelines.html#configuring-behavior-for-intermediate-datase...

Thanks,

Alex

0 Kudos
farhanromli
Level 3
Author

What does the behavior look like now if you have Spark Pipelines enabled and you perform a recursive rebuild of the target dataset
You mean manually do the forced recursive rebuild  by right click on the target dataset?

Can you try adding the option "Virtualizable in build on intermediate dataset 3 and see if this yields the behavior you are looking for? 
Have tried it out per your suggestion but did not work

0 Kudos