Create a check to compare record count between two datasets

Solved!
HeteshPatel
Level 2
Create a check to compare record count between two datasets

Hi, I am currently creating a dashboard which will highlight to me basic checks for row count and column counts. I would also like to put in a check to compare two different metrics e.g. row count from one dataset to another.

Reason for this check is to ensure cardinality of one to one post all the transformations and joins.

I was not sure how to reference another metric or another dataset within the checks page.

1 Solution
fchataigner2
Dataiker

Hi,

you can use the `dataiku` import in a Python check for that, and access the other dataset's metrics for comparison. For example:

import dataiku
def process(last_values, dataset, partition_id):
    this_dataset_record_count_metric = last_values.get('records:COUNT_RECORDS')
    this_dataset_record_count = int(this_dataset_record_count_metric.get_value()) if this_dataset_record_count_metric is not None else 0
    
    other_dataset = dataiku.Dataset("train_set")
    other_dataset_record_count = other_dataset.get_last_metric_values().get_global_value('records:COUNT_RECORDS')
    
    if this_dataset_record_count != other_dataset_record_count:
        return 'ERROR', 'record counts: %s <-> %s' % (this_dataset_record_count, other_dataset_record_count)
    else:
        return 'OK', 'record counts: %s' % this_dataset_record_count

View solution in original post

2 Replies
fchataigner2
Dataiker

Hi,

you can use the `dataiku` import in a Python check for that, and access the other dataset's metrics for comparison. For example:

import dataiku
def process(last_values, dataset, partition_id):
    this_dataset_record_count_metric = last_values.get('records:COUNT_RECORDS')
    this_dataset_record_count = int(this_dataset_record_count_metric.get_value()) if this_dataset_record_count_metric is not None else 0
    
    other_dataset = dataiku.Dataset("train_set")
    other_dataset_record_count = other_dataset.get_last_metric_values().get_global_value('records:COUNT_RECORDS')
    
    if this_dataset_record_count != other_dataset_record_count:
        return 'ERROR', 'record counts: %s <-> %s' % (this_dataset_record_count, other_dataset_record_count)
    else:
        return 'OK', 'record counts: %s' % this_dataset_record_count
HeteshPatel
Level 2
Author

perfect, that works great! thanks @fchataigner2 

0 Kudos