Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Added on November 10, 2022 3:09PM
Likes: 3
Replies: 1
Dataiku’s scenarios are useful for automatically refreshing pipelines since some rebuild options take care of updating any upstream objects that are critical dependencies for downstream analyses.
That said, I don’t always want to (or need to) rebuild absolutely EVERYTHING —- sometimes, basic lookups don’t change, or a sub-flow is quite expensive, either in terms of computational intensity or monetary cost for using a third-party API service.
While building out a meta-analysis using summarized information and statistics about upstream data to make controlling decisions in a downstream scenario, I ran into a problem where I needed to rebuild my “analysis” section, which was in a separate flow zone.
Caption: Computed metrics from one flow zone are used downstream in another flow zone.
There were many recipes in this flow zone, and for simplicity and reliability, it would be preferable to build it recursively using forced recursive rebuild. As a result, future modifications and additions to the flow would automatically be included, rather than having to remember to update the scenario to include them.
The trouble was that one of the input datasets to this Pre-Analytics flow zone was the metrics data derived from the final step of a separate and computationally “expensive” flow zone. With normal datasets, an advanced setting controls the rebuild behavior for the upstream datasets. However, I wanted the propagation to stop at the metrics, and the metrics have no such option.
The reasoning behind this is confusing until you realize that the metrics data are simply a VIEW into the dataset from which they are derived. This means that metrics aren’t ever truly “built”, but instead exist automatically in sync with the dataset from which they come.
Once I realized this, it was simply a matter of setting the rebuild behavior to “explicit” on the metrics parent dataset. Once set, my Pre-Analysis flow zone can reliably be built using the metrics as they presently exist, without propagating beyond the parent dataset.
When it is time to rebuild the data in the upstream “expensive” flow zone, I can also perform this recursively in my scenario by starting the build with the final metrics parent dataset. This works because, as the starting point for a forced rebuild, it is implied that it should propagate data builds upstream — it is, in fact, the very definition of an explicit rebuild.
Read more about my original question and the solution here:
To learn more about dataset-building strategies and behaviors in Dataiku, visit this article in the Knowledge Base: Dataset Building Strategies
Thank you for your notes about explicit rebuilds. Over the past week or so I've been building an approach to do a partial rebuild of a dataset.
I have a large-ish data set that records only change fairly slowly. I'm trying to implement an add, delete, and modify update cycle. The explicit rebuild option is definitely helpful for this.
However, has anyone worked out a good strategy for partial rebuilds of very slow upstream resources? In my case, the upstream resource that I have no control of, takes on the order of 1-2 seconds per record.
This may be a place for some partitioned datasets. However, I'm not clear if once written a particular partition can be updated.