Want to Stop Rebuilding "Expensive" Parts of your Flow? Explicit Builds are the Answer

Options
Jason
Jason Registered Posts: 29 ✭✭✭✭✭

Use Explicit Recursive Builds to Control Breakpoints

Dataiku’s scenarios are useful for automatically refreshing pipelines since some rebuild options take care of updating any upstream objects that are critical dependencies for downstream analyses.

That said, I don’t always want to (or need to) rebuild absolutely EVERYTHING —- sometimes, basic lookups don’t change, or a sub-flow is quite expensive, either in terms of computational intensity or monetary cost for using a third-party API service.

How, then, can I control where to start and stop propagating a recursive build?

My Challenge: Stop Rebuild Propagation at a Metrics Dataset

While building out a meta-analysis using summarized information and statistics about upstream data to make controlling decisions in a downstream scenario, I ran into a problem where I needed to rebuild my “analysis” section, which was in a separate flow zone.
Capture.png

Capture2.pngCaption: Computed metrics from one flow zone are used downstream in another flow zone.

There were many recipes in this flow zone, and for simplicity and reliability, it would be preferable to build it recursively using forced recursive rebuild. As a result, future modifications and additions to the flow would automatically be included, rather than having to remember to update the scenario to include them.

The trouble was that one of the input datasets to this Pre-Analytics flow zone was the metrics data derived from the final step of a separate and computationally “expensive” flow zone. With normal datasets, an advanced setting controls the rebuild behavior for the upstream datasets. However, I wanted the propagation to stop at the metrics, and the metrics have no such option.

How come I can’t set the rebuild behavior to “explicit” for my metrics dataset?

My Aha Moment: Metrics Data are a View Into the Parent Dataset

The reasoning behind this is confusing until you realize that the metrics data are simply a VIEW into the dataset from which they are derived. This means that metrics aren’t ever truly “built”, but instead exist automatically in sync with the dataset from which they come.

Capture3.png

Once I realized this, it was simply a matter of setting the rebuild behavior to “explicit” on the metrics parent dataset. Once set, my Pre-Analysis flow zone can reliably be built using the metrics as they presently exist, without propagating beyond the parent dataset.

When it is time to rebuild the data in the upstream “expensive” flow zone, I can also perform this recursively in my scenario by starting the build with the final metrics parent dataset. This works because, as the starting point for a forced rebuild, it is implied that it should propagate data builds upstream — it is, in fact, the very definition of an explicit rebuild.

Additional Resources

Read more about my original question and the solution here:

To learn more about dataset-building strategies and behaviors in Dataiku, visit this article in the Knowledge Base: Dataset Building Strategies

Answers

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @Jason

    Thank you for your notes about explicit rebuilds. Over the past week or so I've been building an approach to do a partial rebuild of a dataset.

    I have a large-ish data set that records only change fairly slowly. I'm trying to implement an add, delete, and modify update cycle. The explicit rebuild option is definitely helpful for this.

    However, has anyone worked out a good strategy for partial rebuilds of very slow upstream resources? In my case, the upstream resource that I have no control of, takes on the order of 1-2 seconds per record.

    This may be a place for some partitioned datasets. However, I'm not clear if once written a particular partition can be updated.

Setup Info
    Tags
      Help me…