How to get row count & dataset size using project.get_dataset() API?

nmadhu20
nmadhu20 Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 35 Neuron

Hi Team,

We have a requirement wherein we need to log the updated dataset size and row count for all datasets of different projects. We tried two approaches, both of which takes huge amounts of computing time.

It would be really helpful if you can let us know if there is a better, optimized way to perform this given we will be spanning 100 projects.

  1. client = dataiku.api_client()
    DSS_project = client.get_project(project_name)
    dataset = project.get_dataset(dataset_name)
    m = dataset.compute_metrics(metric_ids=['basic:SIZE'])
    row = dataset.compute_metrics(metric_ids=['records:COUNT_RECORDS'])
  2. dt = dataiku.Dataset(dataset_name)
    df = dt.get_dataframe()

    rows = sum(1 for i in dt.iter_rows())
    #rows = df.shape[0]
    size_dt = df.memory_usage(deep=True)
    total_size = size_dt.sum()*0.001

    #in KB, but there is a huge difference between this size & and the one shown under 'status' for a dataset
  3. We also tried running the code in kubernetes container but it showed no difference in the execution time
Tagged:

Best Answer

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,218 Dataiker
    Answer ✓

    Hi @nmadhu20
    ,

    Computing metrics can be a heavy operation depending on the dataset size and type. There is no way to really speed up compute_metrics() in general.

    Your observation of the speedrunning K8s is expected since.

    One possible suggestion would be computing these metrics after the datasets are built? This would spread the computation whenever the dataset is being built.

    This avoids using compute_metric()s instead use the last computed values when the dataset was built. You can enable this from the status tab or from dataset settings in the API for the dataset you need.

    Screenshot 2022-05-10 at 10.52.29.png

    You mentioned the difference between this size & and the one shown under 'status' for a dataset.

    Can you elaborate and share an example perhaps this may be expected as the size in memory in pandas may be the same as the size on disk for example.

Answers

  • AntonB
    AntonB Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Dataiku DSS Adv Designer, Registered, Neuron 2022 Posts: 7 ✭✭✭✭

    Hi @AlexT
    , thanks, that sounds like a solution we will use as well. There is only one catch that I am trying to figure out: how to enable the selected auto to compute after build metrics for all datasets across multiple projects. I am browsing through this site but have not pinpointed it yet: Python APIs — Dataiku DSS 10.0 documentation

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,218 Dataiker
    edited July 17

    Hi @AntonB
    ,

    You can change auto compute after build via the API using something like ( I tested this on DSS 10.0.5)
    This will auto-build for all probes of type records only( they are [1] in the list of probes). For different problems you need adjust the code of [1]

    import dataiku
    import json
    client = dataiku.api_client()
    current_project = client.get_default_project()
    
    all_datasets = current_project.list_datasets(as_type="object")
    for dataset in all_datasets:
        settings = dataset.get_settings()
        settings_raw = settings.get_raw()
        settings_raw['metrics']['probes'][1]['computeOnBuildMode'] = 'PARTITION'
        settings.save()

    Hope this helps!

Setup Info
    Tags
      Help me…