Remove duplicate

Solved!
dave
Level 2
Remove duplicate

Hi,

I have gone through few of the post on the remove duplicate but none of that give the clear answer on the same.

Can you pls. provide the path to showcase how can i use some column with condition if that value repeats it would stop counting the same value with entire row in the output?

 

K.Rgds,

Kalpesh

1 Solution
arnaudde
Dataiker

Hello Dave,
If you want to remove duplicates based on only one column (ie column A in your example) there is no visual recipe solution as described in this post. I would use a python recipe with a code like this as Matt suggested, you only need to replace the "input_dataset", "my_key_colum" and "output_dataset" in the sample.

 

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd

# Recipe inputs
df = dataiku.Dataset("input_dataset").get_dataframe()

df.drop_duplicates(subset=["my_key_colum"], inplace=True)
# or
# df.drop_duplicates(inplace=True)
# to use all columns to compare for duplicates

# Recipe outputs
out = dataiku.Dataset("output_dataset").write_with_schema(df)

 

 If you want to remove duplicates based on all columns you can either use the Distinct visual recipe or the commented line in the above code sample.
Best,
Arnaud

View solution in original post

3 Replies
arnaudde
Dataiker

Hello,
Could you provide an example of your input column and your expected output so that we can better understand what you want to do ?

Best,
Arnaud

0 Kudos
dave
Level 2
Author

@arnaudde ,

very simple example looking to create by visual recipe or data prep. functionality as below.

A                     B                          C

123XHY            456                   C001

123XHY            456                   C001

123XHY            456                   C001

456BCNH          123                 T003

789YBQ            801                   X009

I want to remove those rows which are having 123XHY 

Hope this helps.

 

Rgds,

Dave

0 Kudos
arnaudde
Dataiker

Hello Dave,
If you want to remove duplicates based on only one column (ie column A in your example) there is no visual recipe solution as described in this post. I would use a python recipe with a code like this as Matt suggested, you only need to replace the "input_dataset", "my_key_colum" and "output_dataset" in the sample.

 

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd

# Recipe inputs
df = dataiku.Dataset("input_dataset").get_dataframe()

df.drop_duplicates(subset=["my_key_colum"], inplace=True)
# or
# df.drop_duplicates(inplace=True)
# to use all columns to compare for duplicates

# Recipe outputs
out = dataiku.Dataset("output_dataset").write_with_schema(df)

 

 If you want to remove duplicates based on all columns you can either use the Distinct visual recipe or the commented line in the above code sample.
Best,
Arnaud