Dataset unique identifier

Patrik · July 2022

I am trying to create a list of all Datasets used across all Projects.

Is there a unique identifier (key) for a dataset?

Can you provide context behind internal relations between Project, Connection, and Dataset? I have spotted: dataset_project_key, dataset_name, dataset.full_name, dataset_smartname

Also, how does the dataset sharing between projects work? Do I get duplicates if I list all datasets in all projects? How do I handle these duplicates?

Operating system used: Linux

Zach · July 2022

Hi Patrik,

You can use the full_name attribute as the unique identifier. The full name is the project key followed by the dataset name (ie, PROJECT_KEY.DATASET_NAME), which makes it globally unique.

The dataset name itself (without the project key) is unique within a single project. This is what most of the other attributes are set to (DSSDataset.name, DSSDataset.id, etc).

Datasets are contained within a single project. When you share a dataset with another project, it's just giving that project a read-only view of the data. You can only modify datasets through the origin project that it's contained in.

When you list datasets using the Python API (for example, by using DSSProject.list_datasets()), it doesn't include shared datasets, so you don't have to worry about it returning duplicates.

Additionally, here's some sample code that iterates through all datasets across all projects, and adds their full name to a set:

datasets = set()

for project_key in client.list_project_keys():
    project = client.get_project(project_key)

    for dataset in project.list_datasets(as_type='objects'):
        # Generate the full_name manually since the DSSDataset class doesn't
        # have an attribute for it
        full_name = f'{project_key}.{dataset.name}'
        print(full_name)

        assert full_name not in datasets, f'already in datasets: {full_name}'
        datasets.add(full_name)

Best,

Zach M

Patrik · August 2022

Shouldn't be the unique identifier the "id" instead of "name" ?

Zach · August 2022

dataset.name and dataset.id are the same value. The name and ID are aliases of each other.

Dataset unique identifier

Best Answer

Answers

Categories

Setup Info

Tags