How to retrieve input datasets for a specific dataset using the Python API?
Hi everyone,
I'm trying to use the Dataiku Python API to identify which input datasets were used to create a specific dataset within a project.
For example, in the project "PRISME_INTEGRATION_TABLES", I want to retrieve the direct input datasets that were used to generate the dataset "PRS_Decision_Complement".
I attempted to use project.get_flow().get_graph() but ran into a TypeError: 'DSSProjectFlowGraph' object is not subscriptable, and I'm unsure how to properly access the recipes or the flow connections.
Is there a recommended way to extract this information?
Any guidance or sample code would be greatly appreciated!
Thanks in advance for your help!
Best Answer
-
Hi,
The following code example allows you to list the input datasets that where used in a recipe to build an output dataset:
import dataiku client = dataiku.api_client() project = client.get_project("PROJECTKEY") recipes = project.list_recipes() target = "INPUTDATASET" input_datasets = set() for recipe_item in recipes: recipe = project.get_recipe(recipe_item["name"]) settings = recipe.get_settings() outputs = [] for out_role, out_objs in settings.get_recipe_outputs().items(): outputs += [obj['ref'].split('.')[-1] for obj in out_objs['items']] if target in outputs: for in_role, in_objs in settings.get_recipe_inputs().items(): input_datasets.update([obj['ref'].split('.')[-1] for obj in in_objs['items']]) print(list(input_datasets))Let me know if that helps!