You now have until September 15th to submit your use case or success story to the 2022 Dataiku Frontrunner Awards!ENTER YOUR SUBMISSION

How to get row count & dataset size using project.get_dataset() API?

Solved!
nmadhu20
Neuron
Neuron
How to get row count & dataset size using project.get_dataset() API?

Hi Team,

We have a requirement wherein we need to log the updated dataset size and row count for all datasets of different projects. We tried two approaches, both of which takes huge amounts of computing time.

It would be really helpful if you can let us know if there is a better, optimized way to perform this given we will be spanning 100 projects.

  1. client = dataiku.api_client()
    DSS_project = client.get_project(project_name)
    dataset = project.get_dataset(dataset_name)
    m = dataset.compute_metrics(metric_ids=['basic:SIZE'])
    row = dataset.compute_metrics(metric_ids=['records:COUNT_RECORDS'])
  2. dt = dataiku.Dataset(dataset_name)
    df = dt.get_dataframe()

    rows = sum(1 for i in dt.iter_rows())
    #rows = df.shape[0]
    size_dt = df.memory_usage(deep=True)
    total_size = size_dt.sum()*0.001

    #in KB, but there is a huge difference between this size & and the one shown under 'status' for a dataset
  3. We also tried running the code in kubernetes container but it showed no difference in the execution time
1 Solution
AlexT
Dataiker
Dataiker

Hi @nmadhu20,

Computing metrics can be a heavy operation depending on the dataset size and type. There is no way to really speed up  compute_metrics() in general. 

Your observation of the speedrunning K8s is expected since. 

One possible suggestion would be computing these metrics after the datasets are built? This would spread the computation whenever the dataset is being built.  

This avoids using compute_metric()s  instead use the last computed values when the dataset was built. You can enable this from the status tab or from dataset settings in the API for the dataset you need. 

Screenshot 2022-05-10 at 10.52.29.png

You mentioned the difference between this size & and the one shown under 'status' for a dataset.

Can you elaborate and share an example perhaps this may be expected as the size in memory in pandas may be the same as the size on disk for example. 

 

View solution in original post

3 Replies
AlexT
Dataiker
Dataiker

Hi @nmadhu20,

Computing metrics can be a heavy operation depending on the dataset size and type. There is no way to really speed up  compute_metrics() in general. 

Your observation of the speedrunning K8s is expected since. 

One possible suggestion would be computing these metrics after the datasets are built? This would spread the computation whenever the dataset is being built.  

This avoids using compute_metric()s  instead use the last computed values when the dataset was built. You can enable this from the status tab or from dataset settings in the API for the dataset you need. 

Screenshot 2022-05-10 at 10.52.29.png

You mentioned the difference between this size & and the one shown under 'status' for a dataset.

Can you elaborate and share an example perhaps this may be expected as the size in memory in pandas may be the same as the size on disk for example. 

 

AntonB
Neuron
Neuron

Hi @AlexT , thanks, that sounds like a solution we will use as well. There is only one catch that I am trying to figure out: how to enable the selected auto to compute after build metrics for all datasets across multiple projects. I am browsing through this site but have not pinpointed it yet: Python APIs — Dataiku DSS 10.0 documentation

0 Kudos
AlexT
Dataiker
Dataiker

Hi @AntonB ,

You can change auto compute after build via the API using something like ( I tested this on DSS 10.0.5) 
This will auto-build for all probes of type records only( they are [1] in the list of probes). For different problems you need adjust the code of [1]

 

import dataiku
import json
client = dataiku.api_client()
current_project = client.get_default_project()

all_datasets = current_project.list_datasets(as_type="object")
for dataset in all_datasets:
    settings = dataset.get_settings()
    settings_raw = settings.get_raw()
    settings_raw['metrics']['probes'][1]['computeOnBuildMode'] = 'PARTITION'
    settings.save()

 

Hope this helps!