Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hello, yes I don't know in which case of topic my post falls.
However, I would like to share with you an observation for which I still have not found an adequate solution other than tinkering with alternatives by duplicating scenarios:
Let's say,
you have projects with a multitude of distinct subflows with no dependencies between them. You want to create a scenario that builds the datasets in each part. Therefore you make a scenario with a build step including the list of datasets to process. And there you can see that ๐ Dataiku will execute the list of step datasets one by one.
Of course it happens that I could create several scenarios for each flow and do a multiple-scenario build but this creates the complexity of managing a multitude of scenarios to configure etc.
So,
Is there a way to parallelize objects to build inside the same step or even in the same scenario?
Hi @Grixis6 ,
Dataiku does parallelization in certain cases SQL Pipelines, Spark Pipelines, and Partitioned datasets.
If you don't want to manage multiple scenarios you can achieve what you are looking for, with Python Step in the scenario.
https://doc.dataiku.com/dss/latest/python-api/datasets-other.html#building-a-dataset
import dataiku
from dataiku import pandasutils as pdu
import pandas as pd
client = dataiku.api_client()
project = client.get_default_project()
flow = project.get_flow()
graph = flow.get_graph()
edge_datasets = [d for d in graph.get_items_in_traversal_order() if not d["successors"]]
for ds in edge_datasets:
ds_handle = project.get_dataset(ds["ref"])
ds_handle.build(job_type="RECURSIVE_BUILD",wait=False)
print("done")
Here in this example, it will do a recursive build of the right-most datasets in this project. You can adapt this to build from other projects or specific datasets only. The key here is wait=False which will send all of the build requests without waiting for the build to complete. IN cases like this the defined concurrency limits would apply to how many of these jobs are run right away vs queued:
https://doc.dataiku.com/dss/latest/flow/limits.html#limiting-concurrent-executions
I understand this may be more convenient if it could be defined directly in scenario scenario step so I would suggest you submit Product ideas: https://community.dataiku.com/t5/Product-Ideas/idb-p/Product_Ideas