Applying same operations on multiple datasets

yashpuranik · June 2022

Hello,

I am working on a scenario where I need to apply the same series of steps (Preparation, Join, Filter etc) to a large number of datasets. These datasets are sourced from different databases requiring separate credentials. What is the simplest way of addressing this use case? The way I can come up with is:

1. Populate the input datasets into the flow via SQL or Python recipe. Connecting to the appropriate dataset/connection will be taken care of within the recipe based on an environment variable input to the recipe.

2. Set up a scenario to cycle through all values required for multiple datasets and pass them as environment variables to the code recipe

Is there a less/no-code way to do this?

Alexandru · June 2022

Hi @yashpuranik
,

You may want to look at using Application as a recipe for this use case:
https://doc.dataiku.com/dss/latest/applications/application-as-recipe.html#application-as-recipe
https://knowledge.dataiku.com/latest/courses/o16n/dataiku-applications/create-app-as-recipe.html

You would need to define the input datasets but reuse the rest of the steps in your flow on different datasets.

Let me know if that helps

yashpuranik · June 2022

Hi@AlexT
,

I was aware of Application as a recipe, and was certainly planning to use it to streamline my flow. I would like to streamline the definition of input datasets as well. Something like the following:

1. Set a dataset (table) with the list of input connections I want to apply my recipe too

2. Have an "iterator" recipe that will work on one value/connection at a time. It will load the input dataset, pass it on to the application as a recipe and generate the output.

That way, I don't need multiple sub flows, a single flow can manage the entire operation. Any way to do it outside a code recipe?

Alexandru · June 2022

Outside of the code recipe.

The only non code way I can think of is to copy the subflow to the same project and manually change the input dataset from the flow to the other datasets. This would require no code.

Then you would multiple sub-flows in the same project, if you need to change output datasets you can also use change connection from the Other actions.

yashpuranik · June 2022

Gotcha. It could be useful for Dataiku to implement a visual recipe that abstracts a for loop for situations like these. Admittedly it is a very short for loop in Python, but a visual recipe will help expand the reach for non-programmer citizen data scientists

Applying same operations on multiple datasets

Answers

Categories

Setup Info

Tags