How to automate running multiple runs of a flow

MRvLuijpen
MRvLuijpen Partner, L2 Admin, L2 Designer, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Frontrunner 2022 Participant, Neuron 2023 Posts: 107 Neuron

Hi Community,

Let me first describe the situation: We have set up a flow within DSS (mainly SQL-scripts/queries running on a SQL Server) and need to run this for every month. The month indication is a combination of year and month (so no date format). The total input dataset is really large (few GB) and read-only.

The SQL scripts are set up to use a global variable as month indication and produce the output.

The steps inside the flow are :

- First to set the running month in the global variable,

- Then run several of the SQL scripts.

Once the flow is complete, the global variable is set to a new month and the process is repeated.

How can we automate this? We were thinking about the use of scenario's, but how can we loop through the different months as global variables.

We also did some experiments with partitioning (discrete values), but this forced the script being run in the DSS Engine, which is performing much slower than the SQL Server.

Best Answer

  • ATsao
    ATsao Dataiker Alumni, Registered Posts: 139 ✭✭✭✭✭✭✭✭
    edited July 17 Answer ✓

    Hi there,

    Yes, you should be able to accomplish this by leveraging a custom python step in your scenario. If your SQL script is already configured to use a global variable, you can leverage our python APIs to programmatically retrieve the variable, update it, and then rerun the SQL scripts (by rebuilding the output dataset).

    As an example, let's say you defined two global variables: one containing the full list of months that you want to iterate through and the other being the actual month variable that's used in your recipes. Something like:

    Screen Shot 2020-03-04 at 5.54.01 PM.png

    Then, the following code (used as a custom python step in your scenario) should retrieve your project variables and rebuild the output dataset while iterating through the months provided.

    import dataiku
    from dataiku.scenario import Scenario
    
    # Instantiate client
    client = dataiku.api_client()
    this_proj = client.get_project('YOUR_PROJECT_KEY')
    
    # Create the main handle to interact with the scenario
    scenario = Scenario()
    
    # Retrieve variables
    variables = this_proj.get_variables()
    
    months = variables['standard']['list_of_months']
    
    # Iterate through months and rebuild the output dataset
    for m in months:
        variables['local']['Month'] = m
        this_proj.set_variables(variables)
        
        scenario.build_dataset('YOUR_OUTPUT_DATASET', build_mode="NON_RECURSIVE_FORCED_BUILD")

    Please use this sample code as a reference since you will need to adjust it accordingly based on your specific use case but the general concept should still apply.

    Also, I've gone ahead and attached a sample project that you can reference locally as well. The use case is slightly different but again it illustrates how you can leverage scenarios to programmatically update global variables that are used in recipes and rebuild the corresponding output datasets in your Flow. In this sample project, we are using these same concepts to create a scenario workflow that will send emails with attachments out to a list of recipients where the attachment that is included will contain filtered data based on the target recipient. However, just like in your situation, there are global variables capturing the list of recipients, which will then be used in a recipe, and then a custom python step in a scenario will be used to iterate through the list of recipients and rebuild an output dataset accordingly.

    Again, the concepts are similar but just wanted to provide some examples that you can reference and adapt for your own needs accordingly. Hopefully, this proves helpful to you!

    Best,

    Andrew

Answers

  • MRvLuijpen
    MRvLuijpen Partner, L2 Admin, L2 Designer, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Frontrunner 2022 Participant, Neuron 2023 Posts: 107 Neuron

    Hello Andrew,

    Thank you for your reply, will check this out and get back to you.

  • MRvLuijpen
    MRvLuijpen Partner, L2 Admin, L2 Designer, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Frontrunner 2022 Participant, Neuron 2023 Posts: 107 Neuron
    Hi Andrew,

    I was able to get this working in my project. Instead of a fixed list defined in the global variables, I used the keys from the table.
  • PapaA
    PapaA Registered Posts: 20 ✭✭✭✭

    Hi Andrew,

    Would it be possible to run these jobs in parallel for any kind of recipe?

    The idea is that for different set of parameters we would like to train multiple models over the night by making use of one main file that inputs the data and outputs to a folder.

    How could we parallel same python recipe (main file) for different parameters or lists?

Setup Info
    Tags
      Help me…