Create a check to compare record count between two datasets

Hetesh
Hetesh Registered Posts: 13 ✭✭✭✭

Hi, I am currently creating a dashboard which will highlight to me basic checks for row count and column counts. I would also like to put in a check to compare two different metrics e.g. row count from one dataset to another.

Reason for this check is to ensure cardinality of one to one post all the transformations and joins.

I was not sure how to reference another metric or another dataset within the checks page.

Best Answer

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker
    edited July 17 Answer ✓

    Hi,

    you can use the `dataiku` import in a Python check for that, and access the other dataset's metrics for comparison. For example:

    import dataiku
    def process(last_values, dataset, partition_id):
        this_dataset_record_count_metric = last_values.get('records:COUNT_RECORDS')
        this_dataset_record_count = int(this_dataset_record_count_metric.get_value()) if this_dataset_record_count_metric is not None else 0
        
        other_dataset = dataiku.Dataset("train_set")
        other_dataset_record_count = other_dataset.get_last_metric_values().get_global_value('records:COUNT_RECORDS')
        
        if this_dataset_record_count != other_dataset_record_count:
            return 'ERROR', 'record counts: %s <-> %s' % (this_dataset_record_count, other_dataset_record_count)
        else:
            return 'OK', 'record counts: %s' % this_dataset_record_count

Answers

Setup Info
    Tags
      Help me…