Parallel Processing Multiple Run in Single Code recipe
Hi Everyone,
Is it possible in dataiku dss to run same code multiple time in parallel with same datasets in inputs and outputs?
I am trying to execute single DSS Code Recipe(Python and SQL) multiple times in in parallel to perform on the 100 crores of data for 7 time duration with 200 number of brands with 500 columns.
So I want to perform the core recipe in parallel in any of the engines are available in Dataiku.
Could anyone help me to work such kind of thing without overload on engines and without timeout the execution?
If we have anything please suggest the solution.
Thank you in advance
Answers
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
Welcome to the Dataiku Community. We are so glad that you are joining us.
First of all, I am not an expert on this. However, from what I know about DSS, I would think that you would be thinking about running with what dataiku calls a partitioned dataset on each of the subsets you want to run your process against. Also I suspect you may want to run with a parallel data processing environment, like SPARK.
Others, please jump in on this conversation. I’d like to hear and learn more about how others would approach a project like this.
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron
The fundamental concept in Dataiku is that the project flow can only run a single instance at the time. Think how the outputs of a recipe work, Dataiku will generally truncate and rebuild output datasets. Partitions may allow some paralelisation in some datasets but it's not feasable to have all datasets partitioned. And the output of a partitioned dataset may well be a non-partitioned dataset so how do you handle that case?
One alternative for recipe isolation is the Application-as-recipe option. In this alternative you can execute the same code using different inputs and outputs since it clones the recipe and runs in isolation for the each set of inputs and outputs. But this option it's not designed for scalability, just for code reusability.
Neither of the above are good parallel solutions. Aside from using SPARK like Tom suggests your other option is a poor man's paralelisation of the flow. This can be done by splitting your data with a split recipe so that you can have multiple branches of flow execution running at the same time. In this design you don't "break" flow rules since eahc execution branch uses separate inputs and outputs. This may require slicing the data by some column. In your case you use brand and have 20 brands per flow branch and 10 flow execution branches. Of course this will add some overhead as you first split the data and then combine it back again after processing. And of course you will be subject to the Concurrent activities limits like the Global limit and Per job limit and to the number of cores available in your system.