Applying same operations on multiple datasets

yashpuranik
yashpuranik Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Neuron 2023 Posts: 69 Neuron

Hello,

I am working on a scenario where I need to apply the same series of steps (Preparation, Join, Filter etc) to a large number of datasets. These datasets are sourced from different databases requiring separate credentials. What is the simplest way of addressing this use case? The way I can come up with is:

1. Populate the input datasets into the flow via SQL or Python recipe. Connecting to the appropriate dataset/connection will be taken care of within the recipe based on an environment variable input to the recipe.

2. Set up a scenario to cycle through all values required for multiple datasets and pass them as environment variables to the code recipe

Is there a less/no-code way to do this?

Tagged:

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,211 Dataiker

    Hi @yashpuranik
    ,

    You may want to look at using Application as a recipe for this use case:
    https://doc.dataiku.com/dss/latest/applications/application-as-recipe.html#application-as-recipe
    https://knowledge.dataiku.com/latest/courses/o16n/dataiku-applications/create-app-as-recipe.html

    You would need to define the input datasets but reuse the rest of the steps in your flow on different datasets.

    Let me know if that helps

  • yashpuranik
    yashpuranik Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Neuron 2023 Posts: 69 Neuron

    Hi@AlexT
    ,

    I was aware of Application as a recipe, and was certainly planning to use it to streamline my flow. I would like to streamline the definition of input datasets as well. Something like the following:

    1. Set a dataset (table) with the list of input connections I want to apply my recipe too

    2. Have an "iterator" recipe that will work on one value/connection at a time. It will load the input dataset, pass it on to the application as a recipe and generate the output.

    That way, I don't need multiple sub flows, a single flow can manage the entire operation. Any way to do it outside a code recipe?

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,211 Dataiker

    Outside of the code recipe.

    The only non code way I can think of is to copy the subflow to the same project and manually change the input dataset from the flow to the other datasets. This would require no code.

    Then you would multiple sub-flows in the same project, if you need to change output datasets you can also use change connection from the Other actions.

  • yashpuranik
    yashpuranik Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Neuron 2023 Posts: 69 Neuron

    Gotcha. It could be useful for Dataiku to implement a visual recipe that abstracts a for loop for situations like these. Admittedly it is a very short for loop in Python, but a visual recipe will help expand the reach for non-programmer citizen data scientists

Setup Info
    Tags
      Help me…