Recommendation with build options to control rebuild propogation

me2 · August 2023

I have a flow that doesn't need all datasets rebuilt and would love to get your recommendation on how to implement but also keep it simple.

Flow A I will want to build and launch with a scenario X. Normally I add Scenario step to Build with Build mode: "force-rebuild dataset and dependencies".

Dataset from Flow B in same project and due to time to rebuild and changes in data output, I don't want to rebuild this flow with scenario X. I only want to use the dataset from the last time it was built (use as-is) and it will be rebuilt using another scenario at a slower frequency.

Dataset from Flow C from another project and due to time to rebuild and changes in data output, I don't want to rebuild this flow with scenario X. I only want to use the dataset from the last time it was built (use as-is) and it will be rebuilt using another scenario at a slower frequency.

For the dataset from Flow B, I thought of a way to implement using Sync and "Flow - Rebuild behavior" but I can't help but think there might be an easier way.

Another option is to break up my flow A in a scenario into A and A'. Use Scenario step to Build with Build mode: "force-rebuild dataset and dependencies" for Flow A. Add another Scenario step to Build with Build mode: "build only this dataset". The challenge on that option is I might have additional steps after A' that will require additional scenario steps to finish building.

Also I can't use "build sections" for Flow B for two reasons... 1) How I currently use Flow Zones might cause other datasets I want built to not get built and 2) Our current version of Dataiku doesn't support building sections.

For Flow C, since the dataset can only be built from the source project then I just have to link to the output dataset.

Is there recommendation on how to implement using Dataset from Flow B & C into Flow A? What are the limitations?

I found a great article that has helped me.

concept-dataset-building-strategies

Operating system used: Windows

Turribeach · August 2023

In my view the best option will be to update to v12 and use the new build flow zones capability:

https://doc.dataiku.com/dss/latest/release_notes/12.html#build-flow-zones

Failing that use the quickest solution since you know it's not worth investing time on this and v12 solves it properly so whatever you do you will throw away when you move to v12.

me2 · August 2023

Thank you for the reply. You are right, the best solution for Flow B is the utilize version 12.0+ features for building in zones.

Since our upgrade to V12 will be a few weeks I am going to implement a solution with a sync to copy the dataset from flow B then use the rebuild behavior options + unique scenario step to prevent the entire flow B to get rebuilt.

Once we upgrade to V12, I will implement the zone build.

Recommendation with build options to control rebuild propogation

Best Answer

Answers

Categories

Setup Info

Tags