Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Is it possible in dataiku dss to run same code multiple time in parallel with same datasets in inputs and outputs?
I am trying to execute single DSS Code Recipe(Python and SQL) multiple times in in parallel to perform on the 100 crores of data for 7 time duration with 200 number of brands with 500 columns.
So I want to perform the core recipe in parallel in any of the engines are available in Dataiku.
Could anyone help me to work such kind of thing without overload on engines and without timeout the execution?
If we have anything please suggest the solution.
Thank you in advance
Welcome to the Dataiku Community. We are so glad that you are joining us.
First of all, I am not an expert on this. However, from what I know about DSS, I would think that you would be thinking about running with what dataiku calls a partitioned dataset on each of the subsets you want to run your process against. Also I suspect you may want to run with a parallel data processing environment, like SPARK.
Others, please jump in on this conversation. I’d like to hear and learn more about how others would approach a project like this.
The fundamental concept in Dataiku is that the project flow can only run a single instance at the time. Think how the outputs of a recipe work, Dataiku will generally truncate and rebuild output datasets. Partitions may allow some paralelisation in some datasets but it's not feasable to have all datasets partitioned. And the output of a partitioned dataset may well be a non-partitioned dataset so how do you handle that case?
One alternative for recipe isolation is the Application-as-recipe option. In this alternative you can execute the same code using different inputs and outputs since it clones the recipe and runs in isolation for the each set of inputs and outputs. But this option it's not designed for scalability, just for code reusability.
Neither of the above are good parallel solutions. Aside from using SPARK like Tom suggests your other option is a poor man's paralelisation of the flow. This can be done by splitting your data with a split recipe so that you can have multiple branches of flow execution running at the same time. In this design you don't "break" flow rules since eahc execution branch uses separate inputs and outputs. This may require slicing the data by some column. In your case you use brand and have 20 brands per flow branch and 10 flow execution branches. Of course this will add some overhead as you first split the data and then combine it back again after processing. And of course you will be subject to the Concurrent activities limits like the Global limit and Per job limit and to the number of cores available in your system.