We're excited to announce that we're launching the second installment of Dataiku Product Days Register Now

Rebuild behaviour vs excluding part of flow from build all

MRvLuijpen
Neuron
Neuron
Rebuild behaviour vs excluding part of flow from build all

My question is if and how to exclude the first part of the flow.

(of course without explicitely adding all datasets to be build inside the scenario.)

I have the following situation. Because of company policy, all data source files are automatically removed every night. (Company policy is that all data should be located inside SQL servers and not in external files).

I do have a project flow, which consists of 3 different source files, namely:

  1. Daily updated file (direct file upload)
  2. Weekly updated file (located in folder)
  3. Monthly updated file (located in folder

My idea was to sync all three file to a SQL connection.

I do want to have a scenario that runs daily, to calculate the complete flow (data set Result in attached project flow).
How should/can I setup this to acomplise the following situations:

  1. If no files are available (during weekend) the flow in zone Daily-Calculations should be calculated
  2. During the week, I upload the 'new' daily-file and this should be synced to the daily_file_sql dataset and afterwards the zone Daily-calculations should be calculated.
  3. Once a week, I also upload the 'weekly-file' and this should be synced to the weekly_file_sql dataset and afterwards the zone Daily-calculations should be calculated.
  4. Once a month, all three files are uploaded, sync'ed and afterwards the zone Daily-Calculation should be calculated.

I did experiment with the advanced settings for the rebuild behaviour to set this to "explicit" or "write-protected", but if the files are not present this caused the build-all to fail.

 

I hope this is clear.

 

0 Kudos
3 Replies
MRvLuijpen
Neuron
Neuron
Author
One option of course is to split the flow into 2 different projects instead of into 2 zones
fchataigner2
Dataiker
Dataiker

Hi

the SQLServer datasets right after the sync recipes should be the ones with the rebuild behavior set to explicit, and their rebuild should be triggered by scenarios with triggers "on dataset change" listening on changes in the input folders or uploaded files datasets.

Or as suggested in the other reply, the first zone moved to a separate project, and the SQLServer datasets after the Sync exposed to the project containing the daily flow

MRvLuijpen
Neuron
Neuron
Author

Hello @fchataigner2 ,

Thanks for your reply. Will follow up on your suggestion

0 Kudos
A banner prompting to get Dataiku DSS