API to get dataset size in DSS

joshi123
joshi123 Dataiku DSS Core Designer, Registered Posts: 3 ✭✭✭

I am trying to get the size of all the datasets in my project using a python code in DSS. I am unable to extract this information. Can anyone help resolve this issue?

Answers

  • HarizoR
    HarizoR Dataiker, Alpha Tester, Registered Posts: 138 Dataiker
    edited July 17

    Hi joshi123,

    If by "size" you mean the number of rows and columns of your datasets, you can do so by retrieving metrics using the Dataset API. Here is an example that builds a list of dictionaries, each list item having the name and (number_of_rows, number_of_columns) as values:

    import dataikuapi
    client = dataikuapi.DSSClient(host=YOUR_HOST, api_key=YOUR_API_KEY)
    project = client.get_project(YOUR_PROJECT_KEY)
    
    dataset_sizes = []
    last_val = lambda x: x["lastValues"][0]["value"] if x["lastValues"] else 0
    for d in project.list_datasets():
        dataset_handle = project.get_dataset(d.name)
        dataset_handle.compute_metrics()     # (!) Can be costly for large datasets
        metrics = dataset_handle.get_last_metric_values()
        dataset_sizes.append({"name": d.name,
                              "size": (last_val(metrics.get_metric_by_id("records:COUNT_RECORDS")),
                                       last_val(metrics.get_metric_by_id("basic:COUNT_COLUMNS")))})

    Best,

    Harizo

Setup Info
    Tags
      Help me…