Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi,
I have gone through few of the post on the remove duplicate but none of that give the clear answer on the same.
Can you pls. provide the path to showcase how can i use some column with condition if that value repeats it would stop counting the same value with entire row in the output?
K.Rgds,
Kalpesh
Hello Dave,
If you want to remove duplicates based on only one column (ie column A in your example) there is no visual recipe solution as described in this post. I would use a python recipe with a code like this as Matt suggested, you only need to replace the "input_dataset", "my_key_colum" and "output_dataset" in the sample.
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd
# Recipe inputs
df = dataiku.Dataset("input_dataset").get_dataframe()
df.drop_duplicates(subset=["my_key_colum"], inplace=True)
# or
# df.drop_duplicates(inplace=True)
# to use all columns to compare for duplicates
# Recipe outputs
out = dataiku.Dataset("output_dataset").write_with_schema(df)
If you want to remove duplicates based on all columns you can either use the Distinct visual recipe or the commented line in the above code sample.
Best,
Arnaud
Hello,
Could you provide an example of your input column and your expected output so that we can better understand what you want to do ?
Best,
Arnaud
very simple example looking to create by visual recipe or data prep. functionality as below.
A B C
123XHY 456 C001
123XHY 456 C001
123XHY 456 C001
456BCNH 123 T003
789YBQ 801 X009
I want to remove those rows which are having 123XHY
Hope this helps.
Rgds,
Dave
Hello Dave,
If you want to remove duplicates based on only one column (ie column A in your example) there is no visual recipe solution as described in this post. I would use a python recipe with a code like this as Matt suggested, you only need to replace the "input_dataset", "my_key_colum" and "output_dataset" in the sample.
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd
# Recipe inputs
df = dataiku.Dataset("input_dataset").get_dataframe()
df.drop_duplicates(subset=["my_key_colum"], inplace=True)
# or
# df.drop_duplicates(inplace=True)
# to use all columns to compare for duplicates
# Recipe outputs
out = dataiku.Dataset("output_dataset").write_with_schema(df)
If you want to remove duplicates based on all columns you can either use the Distinct visual recipe or the commented line in the above code sample.
Best,
Arnaud