How to retrieve input datasets for a specific dataset using the Python API?

ELACHAR
ELACHAR Registered Posts: 4 ✭✭

Hi everyone,

I'm trying to use the Dataiku Python API to identify which input datasets were used to create a specific dataset within a project.

For example, in the project "PRISME_INTEGRATION_TABLES", I want to retrieve the direct input datasets that were used to generate the dataset "PRS_Decision_Complement".

I attempted to use project.get_flow().get_graph() but ran into a TypeError: 'DSSProjectFlowGraph' object is not subscriptable, and I'm unsure how to properly access the recipes or the flow connections.

Is there a recommended way to extract this information?
Any guidance or sample code would be greatly appreciated!

Thanks in advance for your help!

Best Answer

  • Lautaro
    Lautaro Registered Posts: 3 ✭✭
    Answer ✓

    Hi,

    The following code example allows you to list the input datasets that where used in a recipe to build an output dataset:

    import dataiku
    
    client = dataiku.api_client()
    project = client.get_project("PROJECTKEY")
    recipes = project.list_recipes()
    
    target = "INPUTDATASET"
    input_datasets = set()
    
    for recipe_item in recipes:
        recipe = project.get_recipe(recipe_item["name"])
        settings = recipe.get_settings()
        outputs = []
        for out_role, out_objs in settings.get_recipe_outputs().items():
            outputs += [obj['ref'].split('.')[-1] for obj in out_objs['items']]
        if target in outputs:
            for in_role, in_objs in settings.get_recipe_inputs().items():
                input_datasets.update([obj['ref'].split('.')[-1] for obj in in_objs['items']])
    print(list(input_datasets))
    

    Let me know if that helps!

Setup Info
    Tags
      Help me…