You now have until September 15th to submit your use case or success story to the 2022 Dataiku Frontrunner Awards!ENTER YOUR SUBMISSION

Dataset unique identifier

Solved!
Patrik
Level 2
Dataset unique identifier

I am trying to create a list of all Datasets used across all Projects.

Is there a unique identifier (key) for a dataset?

Can you provide context behind internal relations between Project, Connection, and Dataset? I have spotted: dataset_project_key, dataset_name, dataset.full_name, dataset_smartname

Also, how does the dataset sharing between projects work? Do I get duplicates if I list all datasets in all projects? How do I handle these duplicates?


Operating system used: Linux

0 Kudos
1 Solution
ZachM
Dataiker
Dataiker

Hi Patrik,

You can use the full_name attribute as the unique identifier. The full name is the project key followed by the dataset name (ie, PROJECT_KEY.DATASET_NAME), which makes it globally unique.

The dataset name itself (without the project key) is unique within a single project. This is what most of the other attributes are set to (DSSDataset.name, DSSDataset.id, etc).


Datasets are contained within a single project. When you share a dataset with another project, it's just giving that project a read-only view of the data. You can only modify datasets through the origin project that it's contained in.

When you list datasets using the Python API (for example, by using DSSProject.list_datasets()), it doesn't include shared datasets, so you don't have to worry about it returning duplicates.


Additionally, here's some sample code that iterates through all datasets across all projects, and adds their full name to a set:

datasets = set()

for project_key in client.list_project_keys():
    project = client.get_project(project_key)

    for dataset in project.list_datasets(as_type='objects'):
        # Generate the full_name manually since the DSSDataset class doesn't
        # have an attribute for it
        full_name = f'{project_key}.{dataset.name}'
        print(full_name)

        assert full_name not in datasets, f'already in datasets: {full_name}'
        datasets.add(full_name)

 

 

Best,

Zach M

View solution in original post

1 Reply
ZachM
Dataiker
Dataiker

Hi Patrik,

You can use the full_name attribute as the unique identifier. The full name is the project key followed by the dataset name (ie, PROJECT_KEY.DATASET_NAME), which makes it globally unique.

The dataset name itself (without the project key) is unique within a single project. This is what most of the other attributes are set to (DSSDataset.name, DSSDataset.id, etc).


Datasets are contained within a single project. When you share a dataset with another project, it's just giving that project a read-only view of the data. You can only modify datasets through the origin project that it's contained in.

When you list datasets using the Python API (for example, by using DSSProject.list_datasets()), it doesn't include shared datasets, so you don't have to worry about it returning duplicates.


Additionally, here's some sample code that iterates through all datasets across all projects, and adds their full name to a set:

datasets = set()

for project_key in client.list_project_keys():
    project = client.get_project(project_key)

    for dataset in project.list_datasets(as_type='objects'):
        # Generate the full_name manually since the DSSDataset class doesn't
        # have an attribute for it
        full_name = f'{project_key}.{dataset.name}'
        print(full_name)

        assert full_name not in datasets, f'already in datasets: {full_name}'
        datasets.add(full_name)

 

 

Best,

Zach M