Remove duplicate

dave
dave Registered Posts: 17 ✭✭✭✭

Hi,

I have gone through few of the post on the remove duplicate but none of that give the clear answer on the same.

Can you pls. provide the path to showcase how can i use some column with condition if that value repeats it would stop counting the same value with entire row in the output?

K.Rgds,

Kalpesh

Best Answer

  • arnaudde
    arnaudde Dataiker Posts: 52 Dataiker
    edited July 17 Answer ✓

    Hello Dave,
    If you want to remove duplicates based on only one column (ie column A in your example) there is no visual recipe solution as described in this post. I would use a python recipe with a code like this as Matt suggested, you only need to replace the "input_dataset", "my_key_colum" and "output_dataset" in the sample.

    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd
    
    # Recipe inputs
    df = dataiku.Dataset("input_dataset").get_dataframe()
    
    df.drop_duplicates(subset=["my_key_colum"], inplace=True)
    # or
    # df.drop_duplicates(inplace=True)
    # to use all columns to compare for duplicates
    
    # Recipe outputs
    out = dataiku.Dataset("output_dataset").write_with_schema(df)

    If you want to remove duplicates based on all columns you can either use the Distinct visual recipe or the commented line in the above code sample.
    Best,
    Arnaud

Answers

Setup Info
    Tags
      Help me…