Force Rebuild + Build Required Dataset vs Spark Pipeline

Farhan
Farhan Registered Posts: 27 ✭✭✭✭

Assume I have below 6 datasets

source dataset

intermediate 1 dataset

intermediate 2 dataset

intermediate 3 dataset

intermediate 4 dataset

target dataset

These are the steps I want to set up in my scenario

1. Force rebuild intermediate 1 dataset

2. Build required dataset intermediate 3 dataset (it will also build intermediate 2)

3. Build required dataset target dataset (it will also build intermediate dataset 4)

Without spark pipeline, step #1 will build intermediate 1, step #2 will build intermediate 2 and 3 while step #3 will build intermediate 4 and target dataset

But with Spark pipeline, sometimes when it comes to step#3, it will rebuild intermediate dataset 3 as well which I think is redundant. Is there a way to avoid intermediate 3 dataset to be rebuilt? I have tried to disable 'Can this recipe be merged in an existing recipes pipeline?' but doesn't seem to work.
One way that I know will work is by setting intermediate 3 dataset as explicit build. But if you can suggest another way without using the explicit build, that would be great


Operating system used: Windows 10

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,269 Dataiker

    Hi,

    What does the behavior look like now if you have Spark Pipelines enabled and you perform a recursive rebuild of the target dataset.

    Can you try adding the option "Virtualizable in build on intermediate dataset 3 and see if this yields the behavior you are looking for?

    https://doc.dataiku.com/dss/latest/spark/pipelines.html#configuring-behavior-for-intermediate-datasets

    Thanks,

    Alex

  • Farhan
    Farhan Registered Posts: 27 ✭✭✭✭

    What does the behavior look like now if you have Spark Pipelines enabled and you perform a recursive rebuild of the target dataset
    You mean manually do the forced recursive rebuild by right click on the target dataset?

    Can you try adding the option "Virtualizable in build on intermediate dataset 3 and see if this yields the behavior you are looking for?
    Have tried it out per your suggestion but did not work

Setup Info
    Tags
      Help me…