Cannot compute metrics on S3 datasets automatically

rmnvncnt · ‎04-05-2018

I'm currently trying to keep track of metrics on a partitioned S3 dataset. I've used the "Autocompute after build" option, but the metrics are not computed on dataset change. I've also tried to compute the metrics using the api (here I'm interested in the record count) :


client = dataiku.api_client()
current_project = client.get_project('project')
sales_s3 = current_project.get_dataset('dataset')
sales_s3.compute_metrics('records:COUNT_RECORDS')

But I get the following error :


DataikuException: java.lang.IllegalArgumentException: For `records:COUNT_RECORDS': Invalid partition identifier, has 1 dimensions, expected 2

Is there a workaround?

Here is what the partition screen looks like :

Alex_Combessie · ‎04-10-2018

Hi,

Ah! This is an interesting topic. There was one small thing missing in your code 🙂

When working on partitioned datasets, the compute_metrics method expects to know which partitions to work on. Hence the correct syntax is:


sales_s3.compute_metrics(partition='2018-03-10', 
                        metric_ids=['records:COUNT_RECORDS'])

Note the use of [] around metric_ids. It has to be a list, which means you can compute several metrics in one go for a given partition. To get the current list of partitions, you should use:


sales_s3.list_partitions()

If you wanted to compute the metric for the whole dataset, then simply pass the partition keyword "ALL"


sales_s3.compute_metrics(partition='ALL', 
                        metric_ids=['records:COUNT_RECORDS'])

Pro-tip: when prototyping code inside a Jupyter notebook, the shortcut Shift+Tab will open a tooltip box with the documentation of the classes and method you are using. There are many useful tricks in Jupyter, have a look at https://www.cheatography.com/weidadeyue/cheat-sheets/jupyter-notebook/pdf_bw/.

Cheers,

Alex

View solution in original post

Alex_Combessie · ‎04-10-2018

Hi,

Ah! This is an interesting topic. There was one small thing missing in your code 🙂

When working on partitioned datasets, the compute_metrics method expects to know which partitions to work on. Hence the correct syntax is:


sales_s3.compute_metrics(partition='2018-03-10', 
                        metric_ids=['records:COUNT_RECORDS'])

Note the use of [] around metric_ids. It has to be a list, which means you can compute several metrics in one go for a given partition. To get the current list of partitions, you should use:


sales_s3.list_partitions()

If you wanted to compute the metric for the whole dataset, then simply pass the partition keyword "ALL"


sales_s3.compute_metrics(partition='ALL', 
                        metric_ids=['records:COUNT_RECORDS'])

Pro-tip: when prototyping code inside a Jupyter notebook, the shortcut Shift+Tab will open a tooltip box with the documentation of the classes and method you are using. There are many useful tricks in Jupyter, have a look at https://www.cheatography.com/weidadeyue/cheat-sheets/jupyter-notebook/pdf_bw/.

Cheers,

Alex

rmnvncnt · ‎04-10-2018

Great! Thanks Alex! Does that mean that passing "ALL" as partition argument computes the metrics globally and for each partition independently?

Alex_Combessie · ‎04-10-2018

"ALL" means computing the metrics for the whole dataset. But it will not compute metrics for each partition. For that, you need to list partitions and compute metric explicitly for each partition.

rmnvncnt · ‎04-11-2018

Alright, thanks a lot!

tanguy · ‎04-25-2023

Hi,

I am trying to do exactly the same thing, ecxept that I am trying to compute metrics (actually a record count) for several partitions simultaneously. I did not manage to do this (I get the same error as the OP), so I had to sequentially compute metrics in a for loop.

See below screenshot for illustration:

Am I doing things wrong or is it currently not possible to compute metrics for several partitions simultaneously?

Cannot compute metrics on S3 datasets automatically

Cannot compute metrics on S3 datasets automatically

Labels

Cloud storage

Metrics & checks

Sign up to take part

Cannot compute metrics on S3 datasets automatically

Cannot compute metrics on S3 datasets automatically

Labels

Cloud storage

Metrics & checks