Flagging Duplicate Rows

Solved!
unkxowx
Level 2
Flagging Duplicate Rows

Scenario:

I have a dataset named "XYZ". It has 5 columns: ID, A, B, C, D. I have uploaded this dataset to dataiku. Now i want to create a new column named "DUPLICATE_FLAG" that will have value of 1 if a duplicate record exists or 0 if does not exist. Rows will be compared and values will be added if everything except ID matches for two or more rows.
 
Is this possible in the Prepare Recipe? I tried python (not very proficient in that) but couldn't figure out.
 
 
0 Kudos
1 Solution
Turribeach

This should do it. Just insert your input and output dataset names:

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
duplicates = dataiku.Dataset("your_input_dataset_id")
duplicates_df = duplicates.get_dataframe()

duplicates_df['Duplicate'] = np.where(duplicates_df.duplicated(keep=False), 'Yes','No')
duplicates_output_df = duplicates_df 

# Write recipe outputs
duplicates_output = dataiku.Dataset("your_output_dataset_id")
duplicates_output.write_with_schema(duplicates_output_df)

 

View solution in original post

0 Kudos
5 Replies
Turribeach

This should do it. Just insert your input and output dataset names:

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
duplicates = dataiku.Dataset("your_input_dataset_id")
duplicates_df = duplicates.get_dataframe()

duplicates_df['Duplicate'] = np.where(duplicates_df.duplicated(keep=False), 'Yes','No')
duplicates_output_df = duplicates_df 

# Write recipe outputs
duplicates_output = dataiku.Dataset("your_output_dataset_id")
duplicates_output.write_with_schema(duplicates_output_df)

 

0 Kudos
unkxowx
Level 2
Author

Hey, thank you for this prompt response. Quick questions:
1) I am assuming this goes into a python recipe and nothing like that can be done in a prepare recipe? 
2) In cases where I want to check duplicates on just a certain column combination, how will that work? For e.g. in SQL if I have 5 columns: ID, A, B, C, D, i can do something like:

select
A, B, C, count(*) as cnt
from table
group by A, B, C
having cnt > 1

In this case, I get the number of combinations A, B, C have with varying values of D (another case of duplication), and now of course we are doing a flag on these.

0 Kudos

You can do this easily with a Group By recipe, then select all the columns you want to compare as the Group Keys and leave the defaults. This will give you an output with all the duplicate records by your grouped columns and their count. Then join this output to the original dataset (the one you used as input to the Group By recipe) to add the count column to the original dataset. Join by all the columns in the Group By output dataset. With the original dataset on the left side of the join you should use the Left Join join type. 

 

Capture.PNG

0 Kudos
unkxowx
Level 2
Author

 ah right, totally forgot about the group by recipe, but this makes a lot of sense. I believe this would cut down the need for python duplicates, right? As instead of flagging, there'll be a "count"

0 Kudos

Yes, but it depends what you want to show. Up to you.

0 Kudos