How to run multiple flows at a time with different datasets?
We have created a flow which takes in different input datasets and updates many tables in Snowflake. There are many scenarios to trigger the recipes, and the scenarios are run using API endpoints which are triggered from a frontend website.
What we want to do is run the flow for multiple inputs. Each of the input will modify some global variables like filename, filetype, etc. We are using these variables which are specific to a particular input to update some tables in Snowflake. Each of these set of variables should remain same for a particular input.
How can we do this in Dataiku? I explored Dataiku Applications, but all I could see is that it creates a flow similar to the original project. But I couldn't find the API endpoints and I am not sure how to trigger the scenarios in the app instances. Also, the Instances don't seem to be scalable.
Is this use case feasible in Dataiku?
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,088 Neuron
We would need to understand better your flow design and your requirements. Scenarios can run only 1 instance of themselves. And in general terms a flow is not supposed to be executed concurrently. If you attempt to refresh the same dataset from two different scenarios at the same time you will get an error. Having said there are ways around these limitations. For instance you can develop a flow with multiple execution branches, all starting from an input managed folder. Instead of using project global variables to set things like file name and file type you can fetch this dynamically from the managed folder. And instead of using API to trigger the scenarios you can use a single dataset changed trigger on the managed folder. Then you can have multiple execution branches, as many as you may want/need to process files parallely. You could separate files by type or whatever other logic and process each file in a different execution branch. The options are endless. This is a much better design since it allows for parallel execution / load of different files. It also doesn't rely on a external API call to trigger the scenario but fires when a file arrives. Event based mini batches are a much better solution as they can process files as they arrive rather than waiting for a big batch run. Below is a sample flow with this idea.