Remove duplicate rows in one column

How can I remove duplicated rows in one column ?

Answers

  • jereze
    jereze Alpha Tester, Dataiker Alumni Posts: 190 ✭✭✭✭✭✭✭✭

    There is currently no way to do that in a visual preparation recipe* (because a visual recipe more or less works row by row, and it cannot work on a full column, as it is designed for big data).

    It's possible to do so in a visual GROUP recipe: click “Show mass actions”, select all columns, click “use as grouping keys”. If the csv is very big, I suggest synchronizing to a SQL DB first.

    You can also do so in coding recipes:

    • In a Python recipe, you can use the Pandas function (see example below) drop_duplicates()
    • In a R recipe, you have several alternative (duplicated(), dplyr, ..): read here
    • In a SQL recipe, I would use a a group by with min or max, or window function with partition by key and keep the first row.

    * There is actually one way to do it in a visual preparation recipe, with a custom Python function, but that will not work all the time (if the recipe is multi-threaded), so I would not recommend this trick:

    I hope that helps,

    Jeremy

  • Mattsco
    Mattsco Dataiker, Registered Posts: 125 Dataiker
    edited July 17

    Hi,



    In a python recipe I would do:


    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd

    # Recipe inputs
    df = dataiku.Dataset("input_dataset").get_dataframe()

    df.drop_duplicates(subset=["my_key_colum"], inplace=True)
    # or
    # df.drop_duplicates(inplace=True)
    # to use all columns to compare for duplicates

    # Recipe outputs
    out = dataiku.Dataset("output_dataset").write_with_schema(df)

    Matt

  • chemical
    chemical Registered Posts: 5
    edited July 17

    Hello, sorry bringing the thread up, I'd need support because I'm trying deduplicating my output with Python but I can't understand how to edit the code. I need to deduplicate TDD_FT_CUSTOMER_JOURNEY_joined_filtered_joined because - as you can see from the screen below - I have for the same ID (first column) duplicate values. Indeed the column CD_BOOKING also has the same repeating values.

    dataiku_ded.PNG

    This is the Pyton code that I haven't touched yet.

    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    
    # Read recipe inputs
    TDD_FT_CUSTOMER_JOURNEY_joined_filtered_joined = dataiku.Dataset("TDD_FT_CUSTOMER_JOURNEY_joined_filtered_joined")
    TDD_FT_CUSTOMER_JOURNEY_joined_filtered_joined_df = TDD_FT_CUSTOMER_JOURNEY_joined_filtered_joined.get_dataframe()
    
    
    # Compute recipe outputs from inputs
    # TODO: Replace this part by your actual code that computes the output, as a Pandas dataframe
    # NB: DSS also supports other kinds of APIs for reading and writing data. Please see doc.
    
    phone_deduplicated_df = TDD_FT_CUSTOMER_JOURNEY_joined_filtered_joined_df # For this sample code, simply copy input to output
    
    
    # Write recipe outputs
    phone_deduplicated = dataiku.Dataset("phone_deduplicated")
    phone_deduplicated.write_with_schema(phone_deduplicated_df)

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,090 Neuron

    Hi Chemical, I suggest you start a new thread since your problem is different. Also you say you want to do it in SQL but you posted Python code.

  • chemical
    chemical Registered Posts: 5

    Ciao Turribeach, I've slightly adjusted my post in order to be better understandable now, I think my question fits enough with what was asked in this thread.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,090 Neuron

    Hi, because you are not the original poster of the question you can't mark the question as answered or select the correct response. So I suggest you repost as a new thread.

Setup Info
    Tags
      Help me…