Recommend Way for Comparing Metrics between Multiple Flow Datasets in a Check

Options
adamnieto
adamnieto Neuron 2020, Neuron, Registered, Neuron 2021, Neuron 2022, Neuron 2023 Posts: 87 Neuron

Hello,

I was wondering if there is any recommended way to create a check for a dataset that can reference metrics calculated from another dataset in the flow (most likely an ancestor dataset).

For instance being able to see if the parent dataset and child dataset had a difference in min and max values for a particular column.

Thank you for your help!

Best Answer

  • VinceDS
    VinceDS Dataiker, Alpha Tester, Dataiku DSS Core Designer Posts: 45 Dataiker
    edited July 17 Answer ✓
    Options

    Hi,

    To achieve this I believe you will need to create a custom check using python. You can retrieve metric values for various datasets in your project using the dataikuapi. Below is a starter code (for example to use in a notebook or recipe) that shows how to use the API to compare metrics values between a source and target datasets:

    import dataiku
    from dataiku import pandasutils as pdu
    import pandas as pd
    import dataikuapi
    
    ##Context
    client = dataiku.api_client()
    proj = client.get_project(dataiku.default_project_key())
    source_dataset = 'tweets_scored_aggregate'
    target_dataset = 'tweets_scored_aggregate_prepared'
    ##Metrics ID
    record_count = 'records:COUNT_RECORDS'
    unique_values = 'col_stats:COUNT_DISTINCT:text'
    
    ##Retrieve Metrics Last Value and Compute Difference
    source_metrics = proj.get_dataset(source_dataset).get_last_metric_values()
    source_record_count = source_metrics.get_metric_by_id(record_count)['lastValues'][0]['value']
    print(source_record_count)
    
    target_metrics = proj.get_dataset(target_dataset).get_last_metric_values()
    target_record_count = target_metrics.get_metric_by_id(record_count)['lastValues'][0]['value']
    print(target_record_count)
    
    gap = float(source_record_count) - float(target_record_count)
    gap

    Then in your custom check you can set the business rule for the Check outcome based on the 'gap' value

    Hope this helps

Answers

  • adamnieto
    adamnieto Neuron 2020, Neuron, Registered, Neuron 2021, Neuron 2022, Neuron 2023 Posts: 87 Neuron
    Options

    Thank you for your help!

  • danf101
    danf101 Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 4 Partner
    Options

    Hi,

    I don't know where to submit a feature request - but I feel like this is a common scenario (i.e. check the validity of new data based on initial dataset). It would be great if there was a dropdown option to check +-2std deviations from precomputed statistics and not only constant values.

  • CoreyS
    CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭
    Options

    Hi @danf101
    please feel free to utilize the Product Ideas board. The Product Ideas board is here to let you share and exchange your ideas on how to improve Dataiku. Here are some resources to help get you started: Suggest an idea

    I hope this helps!

Setup Info
    Tags
      Help me…