Use Explicit Recursive Builds to Control Breakpoints
Dataiku’s scenarios are useful for automatically refreshing pipelines since some rebuild options take care of updating any upstream objects that are critical dependencies for downstream analyses.
That said, I don’t always want to (or need to) rebuild absolutely EVERYTHING —- sometimes, basic lookups don’t change, or a sub-flow is quite expensive, either in terms of computational intensity or monetary cost for using a third-party API service.
How, then, can I control where to start and stop propagating a recursive build?
My Challenge: Stop Rebuild Propagation at a Metrics Dataset
While building out a meta-analysis using summarized information and statistics about upstream data to make controlling decisions in a downstream scenario, I ran into a problem where I needed to rebuild my “analysis” section, which was in a separate flow zone.
Caption: Computed metrics from one flow zone are used downstream in another flow zone.
There were many recipes in this flow zone, and for simplicity and reliability, it would be preferable to build it recursively using forced recursive rebuild. As a result, future modifications and additions to the flow would automatically be included, rather than having to remember to update the scenario to include them.
The trouble was that one of the input datasets to this Pre-Analytics flow zone was the metrics data derived from the final step of a separate and computationally “expensive” flow zone. With normal datasets, an advanced setting controls the rebuild behavior for the upstream datasets. However, I wanted the propagation to stop at the metrics, and the metrics have no such option.
How come I can’t set the rebuild behavior to “explicit” for my metrics dataset?
My Aha Moment: Metrics Data are a View Into the Parent Dataset
The reasoning behind this is confusing until you realize that the metrics data are simply a VIEW into the dataset from which they are derived. This means that metrics aren’t ever truly “built”, but instead exist automatically in sync with the dataset from which they come.
Once I realized this, it was simply a matter of setting the rebuild behavior to “explicit” on the metrics parent dataset. Once set, my Pre-Analysis flow zone can reliably be built using the metrics as they presently exist, without propagating beyond the parent dataset.
When it is time to rebuild the data in the upstream “expensive” flow zone, I can also perform this recursively in my scenario by starting the build with the final metrics parent dataset. This works because, as the starting point for a forced rebuild, it is implied that it should propagate data builds upstream — it is, in fact, the very definition of an explicit rebuild.