How to run Dataiku flow parallel for multiple different parameters

Options
hareesh
hareesh Dataiku DSS Core Designer, Registered Posts: 4

I've created one flow which takes input file from S3 based on Scenario trigger parameters and run the flow and finally saves the processed data into S3 in different locations based on the parameters it'll upload into different path.

Capture.PNG

When I'm triggering the above flow with different scenarios to build ALLOC_STEP9_copy with different file paths as passing arguments to this flow and based on that it'll select the S3 input file.

But the problem is when I'm running simultaneously for multiple input files I'm getting an error like ALLOC_STEP9 dataset is already in use.

Can you help how we can achieve parallel run

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,757 Neuron
    Options

    You can't run your flow concurrently, it is not supported. To be able to do that you can use Dataiku Applications:

    https://knowledge.dataiku.com/latest/mlops-o16n/dataiku-applications/concept-dataiku-applications.html

    Or you can modify your flow to have different execution branches for different file paths. Then rather than triggering your flow with parameters you can use a dataset change trigger (which can also point to a folder) and have the flow kick automatically when new files arrive. Then you could have a Python recipe execute on the folder and grab the new files and push them to different folders. Below is a flow mock up of what I mean. I have not changed the names of the datasets but obviously these would be different names based on the the type of file. Capture.PNG

  • hareesh
    hareesh Dataiku DSS Core Designer, Registered Posts: 4
    Options
    S3 structure
    bucket/plant1/season1/inp_file.csv
    bucket/plant1/season2/inp_file.csv
    bucket/plant2/season1/inp_file.csv
    bucket/plant2/season2/inp_file.csv

    From external API I want to triggger Dataiku flow through scenarios by passing plant and season details and in first step in scenario I'll update the local variable values with plant and season
    In the flow based on variable values it needs to pick up the input file from S3 and run python recipe to generate new processed CSV file with desired name as out_file.csv and load into S3 in different places.

    We are having 20 plants and each having 2 seasons, if we create separate flow for each, it'll be 40.
    Is there any other possility to achieve this concurrent execution with single flow or any
  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 411 Neuron
    Options

    Have you investigated the use of partitions?

    You can check the knowledge base to evaluate if this might do what you need: Tutorial | File-based partitioning - Dataiku Knowledge Base

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,757 Neuron
    Options

    The design I proposed was a sample. You can break down the flow in many different ways. For instance you could break it by season so you will end up with two branches. Also you seem to have ignored my reference to Dataiku Applications. This is the only supported way to run multiple flow executions concurrently. Even partitions as suggested by Ignacio will not allow to run multiple flow executions concurrently although you can recalculate multiple partitions at the same time.

    Finally I would say that the method I proposed using an event based trigger (dataset changed on top of a folder) is the more modern way of approaching this problem. I would also like to know why do you want to run multiple files concurrently, is this actual requirement? Or is it just a side effect of your decision to execute the scenario via an API call? If the files don't arrive all at the same time then there is no point in building a flow that supports concurrent execution. And even if they arrive at the same time you may not have any time constraints to process them serially so why bother. And as per example you can build some parallel execution if you break down your flow in different branches.

Setup Info
    Tags
      Help me…