Remove duplicate
Hi,
I have gone through few of the post on the remove duplicate but none of that give the clear answer on the same.
Can you pls. provide the path to showcase how can i use some column with condition if that value repeats it would stop counting the same value with entire row in the output?
K.Rgds,
Kalpesh
Best Answer
-
Hello Dave,
If you want to remove duplicates based on only one column (ie column A in your example) there is no visual recipe solution as described in this post. I would use a python recipe with a code like this as Matt suggested, you only need to replace the "input_dataset", "my_key_colum" and "output_dataset" in the sample.# -*- coding: utf-8 -*- import dataiku import pandas as pd # Recipe inputs df = dataiku.Dataset("input_dataset").get_dataframe() df.drop_duplicates(subset=["my_key_colum"], inplace=True) # or # df.drop_duplicates(inplace=True) # to use all columns to compare for duplicates # Recipe outputs out = dataiku.Dataset("output_dataset").write_with_schema(df)
If you want to remove duplicates based on all columns you can either use the Distinct visual recipe or the commented line in the above code sample.
Best,
Arnaud
Answers
-
Hello,
Could you provide an example of your input column and your expected output so that we can better understand what you want to do ?Best,
Arnaud -
very simple example looking to create by visual recipe or data prep. functionality as below.
A B C
123XHY 456 C001
123XHY 456 C001
123XHY 456 C001
456BCNH 123 T003
789YBQ 801 X009
I want to remove those rows which are having 123XHY
Hope this helps.
Rgds,
Dave
-
how can i remove duplicate row based on single column without python code
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,971 Neuron
Hi, this is a solved thread from 2021. Please start a new thread with your question.