How to find un-used shared-in datasets of a project with python API ?

Haoran
Haoran Registered Posts: 8 ✭✭✭

Thanks for your time at the beginning.

I have a project and I want to know which datasets are shared-in from other projects (black icons) with Python API.

However, those shared-in datasets could be seperated into 2 group:
① unused: just showing in in the zone for checking
② used: being used through recipe for analyzing

image.png

I have already find the way to recognize those used shared-in datasets through previous Q&A:

client = dataiku.api_client()
project = client.get_default_project()
flow = project.get_flow()
shared_used_datasets = [dataset for dataset in flow.get_graph().get_source_datasets() if '.' in dataset.dataset_name]

However, how could I find those unused datasets? Do you have any ideas?

Thanks!

Operating system used: win11

Best Answer

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,384 Dataiker
    Answer ✓

    Hi,
    What are you hoping to do with the unused shared datasets once you find these?
    You can only remove them by un-sharing from the original project in the UI; there is no public API to unshare or unexpose the items. You can only add ( no remove) exposed/shared objects via the API : https://developer.dataiku.com/latest/api-reference/python/projects.html#dataikuapi.dss.project.DSSProjectSettings.add_exposed_object

    That being said, if you simply want to know which shared datasets are in the 2 categories, you can do something like this :
    #note this can be expensive on large instances with thousands of projects, as you need to loop through all existing projects to find the exposed objects to this project.

    client = dataiku.api_client()
    project = client.get_default_project()
    flow = project.get_flow()
    
    used_datasets = [
        ds.dataset_name
        for ds in flow.get_graph().get_source_datasets()
        if '.' in ds.dataset_name
    ]
    
    print("Used datasets:", used_datasets)
    
    filter_by_project = project.project_key
    standalone_datasets_shared_to_project = [
        f"{project_key}.{obj['localName']}"
        for project_key in client.list_project_keys()
        for obj in client.get_project(project_key)
                         .get_settings()
                         .get_raw()['exposedObjects']['objects']
        if obj['type'] == 'DATASET'
        for rule in obj['rules']
        if rule['targetProject'] == filter_by_project
    ]
    
    print("Shared datasets to project:", standalone_datasets_shared_to_project)
    shared_used = [
        ds for ds in standalone_datasets_shared_to_project
        if ds in used_datasets
    ]
    shared_not_used = [
        ds for ds in standalone_datasets_shared_to_project
        if ds not in used_datasets
    ]
    
    print("Shared AND used:", shared_used)
    print("Shared but NOT used:", shared_not_used)
    
Setup Info
    Tags
      Help me…