Factually one of the weaknesses of scenarios on DSS ? Parallelization

Grixis6 · June 2022

Hello, yes I don't know in which case of topic my post falls.

However, I would like to share with you an observation for which I still have not found an adequate solution other than tinkering with alternatives by duplicating scenarios:

Let's say,

you have projects with a multitude of distinct subflows with no dependencies between them. You want to create a scenario that builds the datasets in each part. Therefore you make a scenario with a build step including the list of datasets to process. And there you can see that Dataiku will execute the list of step datasets one by one.
Of course it happens that I could create several scenarios for each flow and do a multiple-scenario build but this creates the complexity of managing a multitude of scenarios to configure etc.

So,
Is there a way to parallelize objects to build inside the same step or even in the same scenario?

Alexandru · June 2022

Hi @Grixis6
,

Dataiku does parallelization in certain cases SQL Pipelines, Spark Pipelines, and Partitioned datasets.

If you don't want to manage multiple scenarios you can achieve what you are looking for, with Python Step in the scenario.

https://doc.dataiku.com/dss/latest/python-api/datasets-other.html#building-a-dataset

import dataiku
from dataiku import pandasutils as pdu
import pandas as pd

client = dataiku.api_client()

project = client.get_default_project()

flow = project.get_flow()
graph = flow.get_graph()
edge_datasets = [d for d in graph.get_items_in_traversal_order() if not d["successors"]]
for ds in edge_datasets:

    ds_handle = project.get_dataset(ds["ref"])
    ds_handle.build(job_type="RECURSIVE_BUILD",wait=False)
    print("done")

Here in this example, it will do a recursive build of the right-most datasets in this project. You can adapt this to build from other projects or specific datasets only. The key here is wait=False which will send all of the build requests without waiting for the build to complete. IN cases like this the defined concurrency limits would apply to how many of these jobs are run right away vs queued:

https://doc.dataiku.com/dss/latest/flow/limits.html#limiting-concurrent-executions

I understand this may be more convenient if it could be defined directly in scenario scenario step so I would suggest you submit Product ideas: https://community.dataiku.com/t5/Product-Ideas/idb-p/Product_Ideas

Factually one of the weaknesses of scenarios on DSS ? Parallelization

Answers

Categories

Setup Info

Tags