Scenario & Flow using prior results to modify scope of the current updates

tgb417
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,607 Neuron

I have a local file system based flow that uses shell recipes to find and evaluate files in my file system. The second evaluation process is particularly network and compute expensive. To go through all of the data is currently taking on the order of 7 hours. I'm interested in establishing a method to do a incremental update. I'd prefer not to code this all in a single Python Recipe.

I'd like to use shell scripts to do the file identification portion of the process every day looking for moved and changed files. Then only do the expensive file evaluation section of the flow for files that have moved or changed. Theoretically this would save a bunch of time for each run of my process.

Is there a way to feed a resulting dataset about these files back into the shell script that does the computationally complex part of the process. That is if I've already done the file evaluation and it has neither moved or changed size or modification date from the last full evaluation. Just skip it this time.

The challenge I think I'm having is having the end results of a flow fed back into the middle of the flow for the comparison phase. (This would in effect put a loop into the flow.) My understanding is that this is not allowed in a DSS flow. Is that correct?

Any thoughts about this?


Operating system used: Mac OS 10.15.7

Tagged:
Setup Info
    Tags
      Help me…