Remove duplicate

dave · February 2021

Hi,

I have gone through few of the post on the remove duplicate but none of that give the clear answer on the same.

Can you pls. provide the path to showcase how can i use some column with condition if that value repeats it would stop counting the same value with entire row in the output?

K.Rgds,

Kalpesh

arnaudde · February 2021

Hello Dave,
If you want to remove duplicates based on only one column (ie column A in your example) there is no visual recipe solution as described in this post. I would use a python recipe with a code like this as Matt suggested, you only need to replace the "input_dataset", "my_key_colum" and "output_dataset" in the sample.

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd

# Recipe inputs
df = dataiku.Dataset("input_dataset").get_dataframe()

df.drop_duplicates(subset=["my_key_colum"], inplace=True)
# or
# df.drop_duplicates(inplace=True)
# to use all columns to compare for duplicates

# Recipe outputs
out = dataiku.Dataset("output_dataset").write_with_schema(df)

If you want to remove duplicates based on all columns you can either use the Distinct visual recipe or the commented line in the above code sample.
Best,
Arnaud

arnaudde · February 2021

Hello,
Could you provide an example of your input column and your expected output so that we can better understand what you want to do ?

Best,
Arnaud

dave · February 2021

@arnaudde
,

very simple example looking to create by visual recipe or data prep. functionality as below.

A B C

123XHY 456 C001

456BCNH 123 T003

789YBQ 801 X009

I want to remove those rows which are having 123XHY

Hope this helps.

Rgds,

Dave

Karthikeyanvenk · May 29

how can i remove duplicate row based on single column without python code

Turribeach · May 29

Hi, this is a solved thread from 2021. Please start a new thread with your question.

Remove duplicate

Best Answer

Answers

Categories

Setup Info

Tags