Custom check to determine if a columns data is unique (does not have duplicates)

Registered Posts: 7 ✭✭✭✭

I would like to run a check that fails if a column "col1" in my dataset has duplicate values. In the metrics tab I am running the "Distinct value count" on col1 and "Records Counts" on the table. How do I write a custom Python check to determine if the "Distinct value count" on col1 equals "Records Counts" to determine if col1 is unique?

Best Answer

  • Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    edited July 2024 Answer ✓

    Hi,

    Here is an example of such a Python check:


    # Define here a function that returns the outcome of the check.
    def process(last_values, dataset, partition_id):
    # last_values is a dict of the last values of the metrics,
    # with the values as a dataiku.metrics.MetricDataPoint.
    # dataset is a dataiku.Dataset object
    #count_record = last_values["records:COUNT_RECORDS"]["raw"]["value"]
    #count_distinct =
    if last_values["records:COUNT_RECORDS"].get_value()== last_values["col_stats:COUNT_DISTINCT:<PUT_YOUR_COLUMN_NAME_HERE>"].get_value():
    return('OK', "no duplicate")
    else:
    return("ERROR", "duplicates")

    [EDIT] I had forgotten to call the get_value() method on last_values["..."]

Answers

  • Registered Posts: 2

    Thanks Alex, worked like a charm. If I am not mistaken, it's necessary to have the distinct count metric calculated for that column for each run as well in order to make this work.

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.