Flagging Duplicate Rows

unkxowx · ‎02-22-2024

Scenario:

I have a dataset named "XYZ". It has 5 columns: ID, A, B, C, D. I have uploaded this dataset to dataiku. Now i want to create a new column named "DUPLICATE_FLAG" that will have value of 1 if a duplicate record exists or 0 if does not exist. Rows will be compared and values will be added if everything except ID matches for two or more rows.

Is this possible in the Prepare Recipe? I tried python (not very proficient in that) but couldn't figure out.

Turribeach · ‎02-22-2024

This should do it. Just insert your input and output dataset names:

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
duplicates = dataiku.Dataset("your_input_dataset_id")
duplicates_df = duplicates.get_dataframe()

duplicates_df['Duplicate'] = np.where(duplicates_df.duplicated(keep=False), 'Yes','No')
duplicates_output_df = duplicates_df 

# Write recipe outputs
duplicates_output = dataiku.Dataset("your_output_dataset_id")
duplicates_output.write_with_schema(duplicates_output_df)

View solution in original post

Turribeach · ‎02-22-2024

This should do it. Just insert your input and output dataset names:

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
duplicates = dataiku.Dataset("your_input_dataset_id")
duplicates_df = duplicates.get_dataframe()

duplicates_df['Duplicate'] = np.where(duplicates_df.duplicated(keep=False), 'Yes','No')
duplicates_output_df = duplicates_df 

# Write recipe outputs
duplicates_output = dataiku.Dataset("your_output_dataset_id")
duplicates_output.write_with_schema(duplicates_output_df)

unkxowx · ‎02-22-2024

Hey, thank you for this prompt response. Quick questions:
1) I am assuming this goes into a python recipe and nothing like that can be done in a prepare recipe?
2) In cases where I want to check duplicates on just a certain column combination, how will that work? For e.g. in SQL if I have 5 columns: ID, A, B, C, D, i can do something like:

select
A, B, C, count(*) as cnt
from table
group by A, B, C
having cnt > 1

In this case, I get the number of combinations A, B, C have with varying values of D (another case of duplication), and now of course we are doing a flag on these.

Turribeach · ‎02-22-2024

You can do this easily with a Group By recipe, then select all the columns you want to compare as the Group Keys and leave the defaults. This will give you an output with all the duplicate records by your grouped columns and their count. Then join this output to the original dataset (the one you used as input to the Group By recipe) to add the count column to the original dataset. Join by all the columns in the Group By output dataset. With the original dataset on the left side of the join you should use the Left Join join type.

unkxowx · ‎02-22-2024

ah right, totally forgot about the group by recipe, but this makes a lot of sense. I believe this would cut down the need for python duplicates, right? As instead of flagging, there'll be a "count"

Turribeach · ‎02-22-2024

Yes, but it depends what you want to show. Up to you.

Sign up to take part

Flagging Duplicate Rows

Flagging Duplicate Rows