Dataset unique identifier

Options
Patrik
Patrik Registered Posts: 5 ✭✭✭✭

I am trying to create a list of all Datasets used across all Projects.

Is there a unique identifier (key) for a dataset?

Can you provide context behind internal relations between Project, Connection, and Dataset? I have spotted: dataset_project_key, dataset_name, dataset.full_name, dataset_smartname

Also, how does the dataset sharing between projects work? Do I get duplicates if I list all datasets in all projects? How do I handle these duplicates?


Operating system used: Linux

Best Answer

  • Zach
    Zach Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 153 Dataiker
    edited July 17 Answer ✓
    Options

    Hi Patrik,

    You can use the full_name attribute as the unique identifier. The full name is the project key followed by the dataset name (ie, PROJECT_KEY.DATASET_NAME), which makes it globally unique.

    The dataset name itself (without the project key) is unique within a single project. This is what most of the other attributes are set to (DSSDataset.name, DSSDataset.id, etc).


    Datasets are contained within a single project. When you share a dataset with another project, it's just giving that project a read-only view of the data. You can only modify datasets through the origin project that it's contained in.

    When you list datasets using the Python API (for example, by using DSSProject.list_datasets()), it doesn't include shared datasets, so you don't have to worry about it returning duplicates.


    Additionally, here's some sample code that iterates through all datasets across all projects, and adds their full name to a set:

    datasets = set()
    
    for project_key in client.list_project_keys():
        project = client.get_project(project_key)
    
        for dataset in project.list_datasets(as_type='objects'):
            # Generate the full_name manually since the DSSDataset class doesn't
            # have an attribute for it
            full_name = f'{project_key}.{dataset.name}'
            print(full_name)
    
            assert full_name not in datasets, f'already in datasets: {full_name}'
            datasets.add(full_name)

    Best,

    Zach M

Answers

Setup Info
    Tags
      Help me…