How to automate multiple runs of a flow with different source file each time

Darren_Gage · September 2019

Hi

I have a flow that I want to run multiple times (manually triggered), each run requires a new source file. I'd like to load the files into a folder (file_A, file_X, file_c etc) and then kick off the flow to automatically run through each file in turn, so run_1 uses file_A and once complete it starts again with run_2 using file_X etc etc

What is the best way to achieve this? I was thinking uploading all the files to a folder then passing in a list of file names (either in a csv, or editable dataset or pasted into a script) and for each file name the flow would run and look for that file in the folder - maybe passing each file name in turn as a variable? - I just don't know how to achieve this or if there is a better way.

Any advice (with code tips if possible) would be greatly appreciated.

Thank you

Mattsco · September 2019

Hi Darren,

This looks like a good use case for the partitioning features of DSS.

Partitioning allows you to run the same recipes in parallel specifying the partitions you want to run.

In your case you would have a folder with the files:

And from the settings of the folder you can define a partitioning pattern:

Then if you create a recipe with the output partitions by the same dimension you can run the recipes in parallel:

You can check the documentation on this topic, it's pretty advanced!

https://doc.dataiku.com/dss/latest/partitions/index.html

Matt

Darren_Gage · September 2019

Hi Matt

Thanks for this detailed reply, I'll take a look at partioning to see if this will help - does look advanced for me at this time. You mentioned "parallel" in this solution but I believe I need to run these sequentially as the output from the first run with File_A is then used as one of the inputs for the next run eg File_B (I'm identifying changes between the subsequent files) apologies for not labouring this in the question. Do you have any additional ideas? Thanks

Mattsco · September 2019

So for that I would just import file_A and file_B as 2 datasets, do stuff on the first dataset and maybe join it with the second one.
Sorry it's difficult to understand the use case from my side.
If you want to give me more details you can contact me at matt@dataiku.com

How to automate multiple runs of a flow with different source file each time

Answers

Categories

Setup Info

Tags