Join us at the Everyday AI Conference in London, New York & Bengaluru! REGISTER NOW

How to get row count & dataset size using project.get_dataset() API?

Solved!
nmadhu20
Neuron
Neuron
How to get row count & dataset size using project.get_dataset() API?

Hi Team,

We have a requirement wherein we need to log the updated dataset size and row count for all datasets of different projects. We tried two approaches, both of which takes huge amounts of computing time.

It would be really helpful if you can let us know if there is a better, optimized way to perform this given we will be spanning 100 projects.

  1. client = dataiku.api_client()
    DSS_project = client.get_project(project_name)
    dataset = project.get_dataset(dataset_name)
    m = dataset.compute_metrics(metric_ids=['basic:SIZE'])
    row = dataset.compute_metrics(metric_ids=['records:COUNT_RECORDS'])
  2. dt = dataiku.Dataset(dataset_name)
    df = dt.get_dataframe()

    rows = sum(1 for i in dt.iter_rows())
    #rows = df.shape[0]
    size_dt = df.memory_usage(deep=True)
    total_size = size_dt.sum()*0.001

    #in KB, but there is a huge difference between this size & and the one shown under 'status' for a dataset
  3. We also tried running the code in kubernetes container but it showed no difference in the execution time
0 Kudos
1 Solution
AlexT
Dataiker
Dataiker

Hi @nmadhu20,

Computing metrics can be a heavy operation depending on the dataset size and type. There is no way to really speed up  compute_metrics() in general. 

Your observation of the speedrunning K8s is expected since. 

One possible suggestion would be computing these metrics after the datasets are built? This would spread the computation whenever the dataset is being built.  

This avoids using compute_metric()s  instead use the last computed values when the dataset was built. You can enable this from the status tab or from dataset settings in the API for the dataset you need. 

Screenshot 2022-05-10 at 10.52.29.png

You mentioned the difference between this size & and the one shown under 'status' for a dataset.

Can you elaborate and share an example perhaps this may be expected as the size in memory in pandas may be the same as the size on disk for example. 

 

View solution in original post

0 Kudos
1 Reply
AlexT
Dataiker
Dataiker

Hi @nmadhu20,

Computing metrics can be a heavy operation depending on the dataset size and type. There is no way to really speed up  compute_metrics() in general. 

Your observation of the speedrunning K8s is expected since. 

One possible suggestion would be computing these metrics after the datasets are built? This would spread the computation whenever the dataset is being built.  

This avoids using compute_metric()s  instead use the last computed values when the dataset was built. You can enable this from the status tab or from dataset settings in the API for the dataset you need. 

Screenshot 2022-05-10 at 10.52.29.png

You mentioned the difference between this size & and the one shown under 'status' for a dataset.

Can you elaborate and share an example perhaps this may be expected as the size in memory in pandas may be the same as the size on disk for example. 

 

0 Kudos