How to run Dataiku flow parallel for multiple different parameters

hareesh · February 2024

I've created one flow which takes input file from S3 based on Scenario trigger parameters and run the flow and finally saves the processed data into S3 in different locations based on the parameters it'll upload into different path.

When I'm triggering the above flow with different scenarios to build ALLOC_STEP9_copy with different file paths as passing arguments to this flow and based on that it'll select the S3 input file.

But the problem is when I'm running simultaneously for multiple input files I'm getting an error like ALLOC_STEP9 dataset is already in use.

Can you help how we can achieve parallel run

Turribeach · February 2024

You can't run your flow concurrently, it is not supported. To be able to do that you can use Dataiku Applications:

https://knowledge.dataiku.com/latest/mlops-o16n/dataiku-applications/concept-dataiku-applications.html

Or you can modify your flow to have different execution branches for different file paths. Then rather than triggering your flow with parameters you can use a dataset change trigger (which can also point to a folder) and have the flow kick automatically when new files arrive. Then you could have a Python recipe execute on the folder and grab the new files and push them to different folders. Below is a flow mock up of what I mean. I have not changed the names of the datasets but obviously these would be different names based on the the type of file.

hareesh · February 2024

S3 structure

bucket/plant1/season1/inp_file.csv

bucket/plant1/season2/inp_file.csv

bucket/plant2/season1/inp_file.csv

bucket/plant2/season2/inp_file.csv

From external API I want to triggger Dataiku flow through scenarios by passing plant and season details and in first step in scenario I'll update the local variable values with plant and season

In the flow based on variable values it needs to pick up the input file from S3 and run python recipe to generate new processed CSV file with desired name as out_file.csv and load into S3 in different places.

We are having 20 plants and each having 2 seasons, if we create separate flow for each, it'll be 40.

Is there any other possility to achieve this concurrent execution with single flow or any

Ignacio_Toledo · February 2024

Have you investigated the use of partitions?

You can check the knowledge base to evaluate if this might do what you need: Tutorial | File-based partitioning - Dataiku Knowledge Base

Turribeach · February 2024

The design I proposed was a sample. You can break down the flow in many different ways. For instance you could break it by season so you will end up with two branches. Also you seem to have ignored my reference to Dataiku Applications. This is the only supported way to run multiple flow executions concurrently. Even partitions as suggested by Ignacio will not allow to run multiple flow executions concurrently although you can recalculate multiple partitions at the same time.

Finally I would say that the method I proposed using an event based trigger (dataset changed on top of a folder) is the more modern way of approaching this problem. I would also like to know why do you want to run multiple files concurrently, is this actual requirement? Or is it just a side effect of your decision to execute the scenario via an API call? If the files don't arrive all at the same time then there is no point in building a flow that supports concurrent execution. And even if they arrive at the same time you may not have any time constraints to process them serially so why bother. And as per example you can build some parallel execution if you break down your flow in different branches.

How to run Dataiku flow parallel for multiple different parameters

Answers

Categories

Setup Info

Tags