Scenario Building issues

Solved!
zeno_11
Level 3
Scenario Building issues

Hi,

I have a scenario in which I have added the datasets to be build in individual steps. All of these build steps are in "build required datasets" mode.

 

When I run the dataset, I observe that during Step 2, all of the datasets that were build in step 1 are being build again. 

Should I combine all datasets in to a single step and run the scenario?

 

Thanks.

1 Solution
Clรฉment_Stenac

Hi,

If you have a flow like: A->B->C->D

and build in "smart reconstruct" (aka "build required") D, then DSS will:

* first check "has A been modified since B was built" 
   * If yes, then build B then C then D
   * else, check "has B been modified since C was built"
      * if yes, then build C then D
      * else check "has C been modified since D was built"
          * if yes then build D
          * else do nothing

 

So if you do "smart rebuild" of B then "smart rebuild" of C, the second smart rebuild will not rebuild B again because A was not modified.

However, if you have issues with dependencies computation being too slow, putting multiple steps will only make the problem worse.

We highly recommend that you simply put a single scenario step that simply builds the "rightmost" part of your Flow. Yes, computing dependencies can take time in very large flows, especially with partitioning, but it is a price that you will have to pay in any case.

You could technically put multiple scenario steps, each one building in "non-recursive" mode (aka "build only this dataset") each dataset, one after the other. Beware, it must absolutely be different scenario steps, you cannot build a succession of datasets in non-recursive mode in the same step. By doing that, you don't have dependencies computation. But we highly discourage this, as it is far more resource consuming and highly error-prone. This is a bad practice.

View solution in original post

6 Replies
Manuel
Dataiker Alumni

Hi,

Yes, in the latter case, with all datasets indicated in the same step, the smart build will only build all the required datasets only once.

Perhaps you know this already, but with "Force Rebuild Dataset and Dependencies", you also only need to indicate the datasets at the end of your flow, not every single intermediate dataset.

I hope this helps.

 

 

0 Kudos
zeno_11
Level 3
Author

Hey, thanks that helps. 

The problem why I just cant put the destination datasets as steps is because the flow is a very large one, and we have observed that it gets stuck at the "compute dependencies" step for a very very long time, after which it builds. 

Dividing it into multiple steps or intermediate datasets ensures that we can avoid the compute step and also provides a point from where we can restart the scenario if the need arises.

 

 

0 Kudos
Manuel
Dataiker Alumni

Hi,

ok, but then I don't understand how adding all datasets in one step will avoid the "calculating dependencies" time.

Perhaps you can make use of the "rebuild behaviour" options at the dataset level (see attached image). Maybe you could break the build using the "explicit option" and have multiple steps to build the different sections. 

I hope this helps.

0 Kudos
zeno_11
Level 3
Author

Smart reconstruction checks each dataset and recipe upstream of the selected dataset to see if it has been modified more recently than the selected dataset. Dataiku DSS then rebuilds all impacted datasets down to the selected one.

 

If this is the case, having multiple steps for the same flow will always mean that the preceding datasets are rebuilt, because proceeding left to right, the left dataset will always have a newer modified date than the right.

Is there a gap in my understanding?

0 Kudos
Clรฉment_Stenac

Hi,

If you have a flow like: A->B->C->D

and build in "smart reconstruct" (aka "build required") D, then DSS will:

* first check "has A been modified since B was built" 
   * If yes, then build B then C then D
   * else, check "has B been modified since C was built"
      * if yes, then build C then D
      * else check "has C been modified since D was built"
          * if yes then build D
          * else do nothing

 

So if you do "smart rebuild" of B then "smart rebuild" of C, the second smart rebuild will not rebuild B again because A was not modified.

However, if you have issues with dependencies computation being too slow, putting multiple steps will only make the problem worse.

We highly recommend that you simply put a single scenario step that simply builds the "rightmost" part of your Flow. Yes, computing dependencies can take time in very large flows, especially with partitioning, but it is a price that you will have to pay in any case.

You could technically put multiple scenario steps, each one building in "non-recursive" mode (aka "build only this dataset") each dataset, one after the other. Beware, it must absolutely be different scenario steps, you cannot build a succession of datasets in non-recursive mode in the same step. By doing that, you don't have dependencies computation. But we highly discourage this, as it is far more resource consuming and highly error-prone. This is a bad practice.

zeno_11
Level 3
Author

I guess this makes sense, my only concern is in the event of a failure, I have no option than to start the entire flow, which will rebuild all datasets.

0 Kudos