Flagging Duplicate Rows
Scenario:
Best Answer
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,128 Neuron
This should do it. Just insert your input and output dataset names:
import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu # Read recipe inputs duplicates = dataiku.Dataset("your_input_dataset_id") duplicates_df = duplicates.get_dataframe() duplicates_df['Duplicate'] = np.where(duplicates_df.duplicated(keep=False), 'Yes','No') duplicates_output_df = duplicates_df # Write recipe outputs duplicates_output = dataiku.Dataset("your_output_dataset_id") duplicates_output.write_with_schema(duplicates_output_df)
Answers
-
Hey, thank you for this prompt response. Quick questions:
1) I am assuming this goes into a python recipe and nothing like that can be done in a prepare recipe?
2) In cases where I want to check duplicates on just a certain column combination, how will that work? For e.g. in SQL if I have 5 columns: ID, A, B, C, D, i can do something like:
select
A, B, C, count(*) as cnt
from table
group by A, B, C
having cnt > 1
In this case, I get the number of combinations A, B, C have with varying values of D (another case of duplication), and now of course we are doing a flag on these. -
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,128 Neuron
You can do this easily with a Group By recipe, then select all the columns you want to compare as the Group Keys and leave the defaults. This will give you an output with all the duplicate records by your grouped columns and their count. Then join this output to the original dataset (the one you used as input to the Group By recipe) to add the count column to the original dataset. Join by all the columns in the Group By output dataset. With the original dataset on the left side of the join you should use the Left Join join type.
-
ah right, totally forgot about the group by recipe, but this makes a lot of sense. I believe this would cut down the need for python duplicates, right? As instead of flagging, there'll be a "count"
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,128 Neuron
Yes, but it depends what you want to show. Up to you.