Want to Stop Rebuilding "Expensive" Parts of your Flow? Explicit Builds are the Answer

Jason · ‎09-27-2022

Use Explicit Recursive Builds to Control Breakpoints

Dataiku’s scenarios are useful for automatically refreshing pipelines since some rebuild options take care of updating any upstream objects that are critical dependencies for downstream analyses.

That said, I don’t always want to (or need to) rebuild absolutely EVERYTHING —- sometimes, basic lookups don’t change, or a sub-flow is quite expensive, either in terms of computational intensity or monetary cost for using a third-party API service.

How, then, can I control where to start and stop propagating a recursive build?

My Challenge: Stop Rebuild Propagation at a Metrics Dataset

While building out a meta-analysis using summarized information and statistics about upstream data to make controlling decisions in a downstream scenario, I ran into a problem where I needed to rebuild my “analysis” section, which was in a separate flow zone.

Caption: Computed metrics from one flow zone are used downstream in another flow zone.

There were many recipes in this flow zone, and for simplicity and reliability, it would be preferable to build it recursively using forced recursive rebuild. As a result, future modifications and additions to the flow would automatically be included, rather than having to remember to update the scenario to include them.

The trouble was that one of the input datasets to this Pre-Analytics flow zone was the metrics data derived from the final step of a separate and computationally “expensive” flow zone. With normal datasets, an advanced setting controls the rebuild behavior for the upstream datasets. However, I wanted the propagation to stop at the metrics, and the metrics have no such option.

How come I can’t set the rebuild behavior to “explicit” for my metrics dataset?

My Aha Moment: Metrics Data are a View Into the Parent Dataset

The reasoning behind this is confusing until you realize that the metrics data are simply a VIEW into the dataset from which they are derived. This means that metrics aren’t ever truly “built”, but instead exist automatically in sync with the dataset from which they come.

Once I realized this, it was simply a matter of setting the rebuild behavior to “explicit” on the metrics parent dataset. Once set, my Pre-Analysis flow zone can reliably be built using the metrics as they presently exist, without propagating beyond the parent dataset.

When it is time to rebuild the data in the upstream “expensive” flow zone, I can also perform this recursively in my scenario by starting the build with the final metrics parent dataset. This works because, as the starting point for a forced rebuild, it is implied that it should propagate data builds upstream — it is, in fact, the very definition of an explicit rebuild.

Additional Resources:

Read more about my original question and the solution here: Recursive Build doesn't stop at metrics

To learn more about dataset-building strategies and behaviors in Dataiku, visit this article in the Knowledge Base: Dataset Building Strategies

Want to Stop Rebuilding "Expensive" Parts of your Flow? Explicit Builds are the Answer

Use Explicit Recursive Builds to Control Breakpoints

How, then, can I control where to start and stop propagating a recursive build?

My Challenge: Stop Rebuild Propagation at a Metrics Dataset

How come I can’t set the rebuild behavior to “explicit” for my metrics dataset?

My Aha Moment: Metrics Data are a View Into the Parent Dataset

Labels

Generating a dropdown in a Dataiku App using the python do function

When reducing the columns in a dataset, identify the columns to keep, not the columns to remove

Automate deployment on API node