Remove duplicate

dave · ‎02-23-2021

Hi,

I have gone through few of the post on the remove duplicate but none of that give the clear answer on the same.

Can you pls. provide the path to showcase how can i use some column with condition if that value repeats it would stop counting the same value with entire row in the output?

K.Rgds,

Kalpesh

arnaudde · ‎02-23-2021

Hello Dave,
If you want to remove duplicates based on only one column (ie column A in your example) there is no visual recipe solution as described in this post. I would use a python recipe with a code like this as Matt suggested, you only need to replace the "input_dataset", "my_key_colum" and "output_dataset" in the sample.

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd

# Recipe inputs
df = dataiku.Dataset("input_dataset").get_dataframe()

df.drop_duplicates(subset=["my_key_colum"], inplace=True)
# or
# df.drop_duplicates(inplace=True)
# to use all columns to compare for duplicates

# Recipe outputs
out = dataiku.Dataset("output_dataset").write_with_schema(df)

If you want to remove duplicates based on all columns you can either use the Distinct visual recipe or the commented line in the above code sample.
Best,
Arnaud

View solution in original post

arnaudde · ‎02-23-2021

Hello,
Could you provide an example of your input column and your expected output so that we can better understand what you want to do ?

Best,
Arnaud

dave · ‎02-23-2021

@arnaudde ,

very simple example looking to create by visual recipe or data prep. functionality as below.

A B C

123XHY 456 C001

456BCNH 123 T003

789YBQ 801 X009

I want to remove those rows which are having 123XHY

Hope this helps.

Rgds,

Dave

arnaudde · ‎02-23-2021

Hello Dave,
If you want to remove duplicates based on only one column (ie column A in your example) there is no visual recipe solution as described in this post. I would use a python recipe with a code like this as Matt suggested, you only need to replace the "input_dataset", "my_key_colum" and "output_dataset" in the sample.

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd

# Recipe inputs
df = dataiku.Dataset("input_dataset").get_dataframe()

df.drop_duplicates(subset=["my_key_colum"], inplace=True)
# or
# df.drop_duplicates(inplace=True)
# to use all columns to compare for duplicates

# Recipe outputs
out = dataiku.Dataset("output_dataset").write_with_schema(df)

If you want to remove duplicates based on all columns you can either use the Distinct visual recipe or the commented line in the above code sample.
Best,
Arnaud

Sign up to take part

Remove duplicate

Remove duplicate