Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Scenario:
This should do it. Just insert your input and output dataset names:
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
# Read recipe inputs
duplicates = dataiku.Dataset("your_input_dataset_id")
duplicates_df = duplicates.get_dataframe()
duplicates_df['Duplicate'] = np.where(duplicates_df.duplicated(keep=False), 'Yes','No')
duplicates_output_df = duplicates_df
# Write recipe outputs
duplicates_output = dataiku.Dataset("your_output_dataset_id")
duplicates_output.write_with_schema(duplicates_output_df)
This should do it. Just insert your input and output dataset names:
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
# Read recipe inputs
duplicates = dataiku.Dataset("your_input_dataset_id")
duplicates_df = duplicates.get_dataframe()
duplicates_df['Duplicate'] = np.where(duplicates_df.duplicated(keep=False), 'Yes','No')
duplicates_output_df = duplicates_df
# Write recipe outputs
duplicates_output = dataiku.Dataset("your_output_dataset_id")
duplicates_output.write_with_schema(duplicates_output_df)
Hey, thank you for this prompt response. Quick questions:
1) I am assuming this goes into a python recipe and nothing like that can be done in a prepare recipe?
2) In cases where I want to check duplicates on just a certain column combination, how will that work? For e.g. in SQL if I have 5 columns: ID, A, B, C, D, i can do something like:
select
A, B, C, count(*) as cnt
from table
group by A, B, C
having cnt > 1
In this case, I get the number of combinations A, B, C have with varying values of D (another case of duplication), and now of course we are doing a flag on these.
You can do this easily with a Group By recipe, then select all the columns you want to compare as the Group Keys and leave the defaults. This will give you an output with all the duplicate records by your grouped columns and their count. Then join this output to the original dataset (the one you used as input to the Group By recipe) to add the count column to the original dataset. Join by all the columns in the Group By output dataset. With the original dataset on the left side of the join you should use the Left Join join type.
ah right, totally forgot about the group by recipe, but this makes a lot of sense. I believe this would cut down the need for python duplicates, right? As instead of flagging, there'll be a "count"
Yes, but it depends what you want to show. Up to you.