Factually one of the weaknesses of scenarios on DSS ? Parallelization
Hello, yes I don't know in which case of topic my post falls.
However, I would like to share with you an observation for which I still have not found an adequate solution other than tinkering with alternatives by duplicating scenarios:
Let's say,
you have projects with a multitude of distinct subflows with no dependencies between them. You want to create a scenario that builds the datasets in each part. Therefore you make a scenario with a build step including the list of datasets to process. And there you can see that
Of course it happens that I could create several scenarios for each flow and do a multiple-scenario build but this creates the complexity of managing a multitude of scenarios to configure etc.
So,
Is there a way to parallelize objects to build inside the same step or even in the same scenario?
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi @Grixis6
,Dataiku does parallelization in certain cases SQL Pipelines, Spark Pipelines, and Partitioned datasets.
If you don't want to manage multiple scenarios you can achieve what you are looking for, with Python Step in the scenario.
https://doc.dataiku.com/dss/latest/python-api/datasets-other.html#building-a-dataset
import dataiku from dataiku import pandasutils as pdu import pandas as pd client = dataiku.api_client() project = client.get_default_project() flow = project.get_flow() graph = flow.get_graph() edge_datasets = [d for d in graph.get_items_in_traversal_order() if not d["successors"]] for ds in edge_datasets: ds_handle = project.get_dataset(ds["ref"]) ds_handle.build(job_type="RECURSIVE_BUILD",wait=False) print("done")
Here in this example, it will do a recursive build of the right-most datasets in this project. You can adapt this to build from other projects or specific datasets only. The key here is wait=False which will send all of the build requests without waiting for the build to complete. IN cases like this the defined concurrency limits would apply to how many of these jobs are run right away vs queued:
https://doc.dataiku.com/dss/latest/flow/limits.html#limiting-concurrent-executions
I understand this may be more convenient if it could be defined directly in scenario scenario step so I would suggest you submit Product ideas: https://community.dataiku.com/t5/Product-Ideas/idb-p/Product_Ideas