Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Added on February 1, 2024 3:30PM
Likes: 0
Replies: 4
I've created one flow which takes input file from S3 based on Scenario trigger parameters and run the flow and finally saves the processed data into S3 in different locations based on the parameters it'll upload into different path.
When I'm triggering the above flow with different scenarios to build ALLOC_STEP9_copy with different file paths as passing arguments to this flow and based on that it'll select the S3 input file.
But the problem is when I'm running simultaneously for multiple input files I'm getting an error like ALLOC_STEP9 dataset is already in use.
Can you help how we can achieve parallel run
You can't run your flow concurrently, it is not supported. To be able to do that you can use Dataiku Applications:
Or you can modify your flow to have different execution branches for different file paths. Then rather than triggering your flow with parameters you can use a dataset change trigger (which can also point to a folder) and have the flow kick automatically when new files arrive. Then you could have a Python recipe execute on the folder and grab the new files and push them to different folders. Below is a flow mock up of what I mean. I have not changed the names of the datasets but obviously these would be different names based on the the type of file.
Have you investigated the use of partitions?
You can check the knowledge base to evaluate if this might do what you need: Tutorial | File-based partitioning - Dataiku Knowledge Base
The design I proposed was a sample. You can break down the flow in many different ways. For instance you could break it by season so you will end up with two branches. Also you seem to have ignored my reference to Dataiku Applications. This is the only supported way to run multiple flow executions concurrently. Even partitions as suggested by Ignacio will not allow to run multiple flow executions concurrently although you can recalculate multiple partitions at the same time.
Finally I would say that the method I proposed using an event based trigger (dataset changed on top of a folder) is the more modern way of approaching this problem. I would also like to know why do you want to run multiple files concurrently, is this actual requirement? Or is it just a side effect of your decision to execute the scenario via an API call? If the files don't arrive all at the same time then there is no point in building a flow that supports concurrent execution. And even if they arrive at the same time you may not have any time constraints to process them serially so why bother. And as per example you can build some parallel execution if you break down your flow in different branches.