Want to Stop Rebuilding "Expensive" Parts of your Flow? Explicit Builds are the Answer!READ MORE

Dataset unique identifier

Solved!
Patrik
Level 2
Dataset unique identifier

I am trying to create a list of all Datasets used across all Projects.

Is there a unique identifier (key) for a dataset?

Can you provide context behind internal relations between Project, Connection, and Dataset? I have spotted: dataset_project_key, dataset_name, dataset.full_name, dataset_smartname

Also, how does the dataset sharing between projects work? Do I get duplicates if I list all datasets in all projects? How do I handle these duplicates?


Operating system used: Linux

0 Kudos
1 Solution
ZachM
Dataiker
Dataiker

Hi Patrik,

You can use the full_name attribute as the unique identifier. The full name is the project key followed by the dataset name (ie, PROJECT_KEY.DATASET_NAME), which makes it globally unique.

The dataset name itself (without the project key) is unique within a single project. This is what most of the other attributes are set to (DSSDataset.name, DSSDataset.id, etc).


Datasets are contained within a single project. When you share a dataset with another project, it's just giving that project a read-only view of the data. You can only modify datasets through the origin project that it's contained in.

When you list datasets using the Python API (for example, by using DSSProject.list_datasets()), it doesn't include shared datasets, so you don't have to worry about it returning duplicates.


Additionally, here's some sample code that iterates through all datasets across all projects, and adds their full name to a set:

datasets = set()

for project_key in client.list_project_keys():
    project = client.get_project(project_key)

    for dataset in project.list_datasets(as_type='objects'):
        # Generate the full_name manually since the DSSDataset class doesn't
        # have an attribute for it
        full_name = f'{project_key}.{dataset.name}'
        print(full_name)

        assert full_name not in datasets, f'already in datasets: {full_name}'
        datasets.add(full_name)

 

 

Best,

Zach M

View solution in original post

3 Replies
ZachM
Dataiker
Dataiker

Hi Patrik,

You can use the full_name attribute as the unique identifier. The full name is the project key followed by the dataset name (ie, PROJECT_KEY.DATASET_NAME), which makes it globally unique.

The dataset name itself (without the project key) is unique within a single project. This is what most of the other attributes are set to (DSSDataset.name, DSSDataset.id, etc).


Datasets are contained within a single project. When you share a dataset with another project, it's just giving that project a read-only view of the data. You can only modify datasets through the origin project that it's contained in.

When you list datasets using the Python API (for example, by using DSSProject.list_datasets()), it doesn't include shared datasets, so you don't have to worry about it returning duplicates.


Additionally, here's some sample code that iterates through all datasets across all projects, and adds their full name to a set:

datasets = set()

for project_key in client.list_project_keys():
    project = client.get_project(project_key)

    for dataset in project.list_datasets(as_type='objects'):
        # Generate the full_name manually since the DSSDataset class doesn't
        # have an attribute for it
        full_name = f'{project_key}.{dataset.name}'
        print(full_name)

        assert full_name not in datasets, f'already in datasets: {full_name}'
        datasets.add(full_name)

 

 

Best,

Zach M

Patrik
Level 2
Author

Shouldn't be the unique identifier the "id" instead of "name" ?

0 Kudos
ZachM
Dataiker
Dataiker

dataset.name and dataset.id are the same value. The name and ID are aliases of each other.