Flagging Duplicate Rows

unkxowx
unkxowx Registered Posts: 19

Scenario:

I have a dataset named "XYZ". It has 5 columns: ID, A, B, C, D. I have uploaded this dataset to dataiku. Now i want to create a new column named "DUPLICATE_FLAG" that will have value of 1 if a duplicate record exists or 0 if does not exist. Rows will be compared and values will be added if everything except ID matches for two or more rows.
Is this possible in the Prepare Recipe? I tried python (not very proficient in that) but couldn't figure out.

Best Answer

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,166 Neuron
    edited July 17 Answer ✓

    This should do it. Just insert your input and output dataset names:

    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    
    # Read recipe inputs
    duplicates = dataiku.Dataset("your_input_dataset_id")
    duplicates_df = duplicates.get_dataframe()
    
    duplicates_df['Duplicate'] = np.where(duplicates_df.duplicated(keep=False), 'Yes','No')
    duplicates_output_df = duplicates_df 
    
    # Write recipe outputs
    duplicates_output = dataiku.Dataset("your_output_dataset_id")
    duplicates_output.write_with_schema(duplicates_output_df)

Answers

  • unkxowx
    unkxowx Registered Posts: 19

    Hey, thank you for this prompt response. Quick questions:
    1) I am assuming this goes into a python recipe and nothing like that can be done in a prepare recipe?
    2) In cases where I want to check duplicates on just a certain column combination, how will that work? For e.g. in SQL if I have 5 columns: ID, A, B, C, D, i can do something like:

    select
    A, B, C, count(*) as cnt
    from table
    group by A, B, C
    having cnt > 1

    In this case, I get the number of combinations A, B, C have with varying values of D (another case of duplication), and now of course we are doing a flag on these.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,166 Neuron

    You can do this easily with a Group By recipe, then select all the columns you want to compare as the Group Keys and leave the defaults. This will give you an output with all the duplicate records by your grouped columns and their count. Then join this output to the original dataset (the one you used as input to the Group By recipe) to add the count column to the original dataset. Join by all the columns in the Group By output dataset. With the original dataset on the left side of the join you should use the Left Join join type.

    Capture.PNG

  • unkxowx
    unkxowx Registered Posts: 19

    ah right, totally forgot about the group by recipe, but this makes a lot of sense. I believe this would cut down the need for python duplicates, right? As instead of flagging, there'll be a "count"

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,166 Neuron

    Yes, but it depends what you want to show. Up to you.

Setup Info
    Tags
      Help me…