How to find un-used shared-in datasets of a project with python API ?
Thanks for your time at the beginning.
I have a project and I want to know which datasets are shared-in from other projects (black icons) with Python API.
However, those shared-in datasets could be seperated into 2 group:
① unused: just showing in in the zone for checking
② used: being used through recipe for analyzing
I have already find the way to recognize those used shared-in datasets through previous Q&A:
client = dataiku.api_client() project = client.get_default_project() flow = project.get_flow() shared_used_datasets = [dataset for dataset in flow.get_graph().get_source_datasets() if '.' in dataset.dataset_name]
However, how could I find those unused datasets? Do you have any ideas?
Thanks!
Operating system used: win11
Best Answer
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,384 DataikerHi,
What are you hoping to do with the unused shared datasets once you find these?
You can only remove them by un-sharing from the original project in the UI; there is no public API to unshare or unexpose the items. You can only add ( no remove) exposed/shared objects via the API :
That being said, if you simply want to know which shared datasets are in the 2 categories, you can do something like this :
#note this can be expensive on large instances with thousands of projects, as you need to loop through all existing projects to find the exposed objects to this project.client = dataiku.api_client() project = client.get_default_project() flow = project.get_flow() used_datasets = [ ds.dataset_name for ds in flow.get_graph().get_source_datasets() if '.' in ds.dataset_name ] print("Used datasets:", used_datasets) filter_by_project = project.project_key standalone_datasets_shared_to_project = [ f"{project_key}.{obj['localName']}" for project_key in client.list_project_keys() for obj in client.get_project(project_key) .get_settings() .get_raw()['exposedObjects']['objects'] if obj['type'] == 'DATASET' for rule in obj['rules'] if rule['targetProject'] == filter_by_project ] print("Shared datasets to project:", standalone_datasets_shared_to_project) shared_used = [ ds for ds in standalone_datasets_shared_to_project if ds in used_datasets ] shared_not_used = [ ds for ds in standalone_datasets_shared_to_project if ds not in used_datasets ] print("Shared AND used:", shared_used) print("Shared but NOT used:", shared_not_used)
