How to get row count & dataset size using project.get_dataset() API?
Hi Team,
We have a requirement wherein we need to log the updated dataset size and row count for all datasets of different projects. We tried two approaches, both of which takes huge amounts of computing time.
It would be really helpful if you can let us know if there is a better, optimized way to perform this given we will be spanning 100 projects.
- client = dataiku.api_client()
DSS_project = client.get_project(project_name)
dataset = project.get_dataset(dataset_name)
m = dataset.compute_metrics(metric_ids=['basic:SIZE'])
row = dataset.compute_metrics(metric_ids=['records:COUNT_RECORDS']) dt = dataiku.Dataset(dataset_name)
df = dt.get_dataframe()rows = sum(1 for i in dt.iter_rows())
#in KB, but there is a huge difference between this size & and the one shown under 'status' for a dataset
#rows = df.shape[0]
size_dt = df.memory_usage(deep=True)
total_size = size_dt.sum()*0.001- We also tried running the code in kubernetes container but it showed no difference in the execution time
Best Answer
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi @nmadhu20
,Computing metrics can be a heavy operation depending on the dataset size and type. There is no way to really speed up compute_metrics() in general.
Your observation of the speedrunning K8s is expected since.
One possible suggestion would be computing these metrics after the datasets are built? This would spread the computation whenever the dataset is being built.
This avoids using compute_metric()s instead use the last computed values when the dataset was built. You can enable this from the status tab or from dataset settings in the API for the dataset you need.
You mentioned the difference between this size & and the one shown under 'status' for a dataset.Can you elaborate and share an example perhaps this may be expected as the size in memory in pandas may be the same as the size on disk for example.
Answers
-
AntonB Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Dataiku DSS Adv Designer, Registered, Neuron 2022 Posts: 7 ✭✭✭✭
Hi @AlexT
, thanks, that sounds like a solution we will use as well. There is only one catch that I am trying to figure out: how to enable the selected auto to compute after build metrics for all datasets across multiple projects. I am browsing through this site but have not pinpointed it yet: Python APIs — Dataiku DSS 10.0 documentation -
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi @AntonB
,You can change auto compute after build via the API using something like ( I tested this on DSS 10.0.5)
This will auto-build for all probes of type records only( they are [1] in the list of probes). For different problems you need adjust the code of [1]import dataiku import json client = dataiku.api_client() current_project = client.get_default_project() all_datasets = current_project.list_datasets(as_type="object") for dataset in all_datasets: settings = dataset.get_settings() settings_raw = settings.get_raw() settings_raw['metrics']['probes'][1]['computeOnBuildMode'] = 'PARTITION' settings.save()
Hope this helps!