Factually one of the weaknesses of scenarios on DSS ? Parallelization

Grixis6
Grixis6 Registered Posts: 15 ✭✭✭✭✭

Hello, yes I don't know in which case of topic my post falls.

However, I would like to share with you an observation for which I still have not found an adequate solution other than tinkering with alternatives by duplicating scenarios:

Let's say,

you have projects with a multitude of distinct subflows with no dependencies between them. You want to create a scenario that builds the datasets in each part. Therefore you make a scenario with a build step including the list of datasets to process. And there you can see that Dataiku will execute the list of step datasets one by one.
Of course it happens that I could create several scenarios for each flow and do a multiple-scenario build but this creates the complexity of managing a multitude of scenarios to configure etc.

So,
Is there a way to parallelize objects to build inside the same step or even in the same scenario?

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker
    edited July 17

    Hi @Grixis6
    ,

    Dataiku does parallelization in certain cases SQL Pipelines, Spark Pipelines, and Partitioned datasets.

    If you don't want to manage multiple scenarios you can achieve what you are looking for, with Python Step in the scenario.

    https://doc.dataiku.com/dss/latest/python-api/datasets-other.html#building-a-dataset

    import dataiku
    from dataiku import pandasutils as pdu
    import pandas as pd
    
    client = dataiku.api_client()
    
    project = client.get_default_project()
    
    flow = project.get_flow()
    graph = flow.get_graph()
    edge_datasets = [d for d in graph.get_items_in_traversal_order() if not d["successors"]]
    for ds in edge_datasets:
    
        ds_handle = project.get_dataset(ds["ref"])
        ds_handle.build(job_type="RECURSIVE_BUILD",wait=False)
        print("done")

    Here in this example, it will do a recursive build of the right-most datasets in this project. You can adapt this to build from other projects or specific datasets only. The key here is wait=False which will send all of the build requests without waiting for the build to complete. IN cases like this the defined concurrency limits would apply to how many of these jobs are run right away vs queued:

    https://doc.dataiku.com/dss/latest/flow/limits.html#limiting-concurrent-executions

    I understand this may be more convenient if it could be defined directly in scenario scenario step so I would suggest you submit Product ideas: https://community.dataiku.com/t5/Product-Ideas/idb-p/Product_Ideas

Setup Info
    Tags
      Help me…