Remove duplicate rows in one column

Léa · ‎03-09-2016

How can I remove duplicated rows in one column ?

jereze · ‎03-10-2016

There is currently no way to do that in a visual preparation recipe* (because a visual recipe more or less works row by row, and it cannot work on a full column, as it is designed for big data).

It's possible to do so in a visual GROUP recipe: click “Show mass actions”, select all columns, click “use as grouping keys”. If the csv is very big, I suggest synchronizing to a SQL DB first.

You can also do so in coding recipes:

In a Python recipe, you can use the Pandas function (see example below) drop_duplicates()

In a R recipe, you have several alternative (duplicated(), dplyr, ..): read here

In a SQL recipe, I would use a a group by with min or max, or window function with partition by key and keep the first row.

* There is actually one way to do it in a visual preparation recipe, with a custom Python function, but that will not work all the time (if the recipe is multi-threaded), so I would not recommend this trick:

I hope that helps,

Jeremy

Jeremy, Product Manager at Dataiku

Mattsco · ‎03-10-2016

Hi,

In a python recipe I would do:


# -*- coding: utf-8 -*-
import dataiku
import pandas as pd

# Recipe inputs
df = dataiku.Dataset("input_dataset").get_dataframe()

df.drop_duplicates(subset=["my_key_colum"], inplace=True)
# or
# df.drop_duplicates(inplace=True)
# to use all columns to compare for duplicates

# Recipe outputs
out = dataiku.Dataset("output_dataset").write_with_schema(df)

Matt

Mattsco

chemical · ‎07-07-2023

Hello, sorry bringing the thread up, I'd need support because I'm trying deduplicating my output with Python but I can't understand how to edit the code. I need to deduplicate TDD_FT_CUSTOMER_JOURNEY_joined_filtered_joined because - as you can see from the screen below - I have for the same ID (first column) duplicate values. Indeed the column CD_BOOKING also has the same repeating values.

This is the Pyton code that I haven't touched yet.

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
TDD_FT_CUSTOMER_JOURNEY_joined_filtered_joined = dataiku.Dataset("TDD_FT_CUSTOMER_JOURNEY_joined_filtered_joined")
TDD_FT_CUSTOMER_JOURNEY_joined_filtered_joined_df = TDD_FT_CUSTOMER_JOURNEY_joined_filtered_joined.get_dataframe()


# Compute recipe outputs from inputs
# TODO: Replace this part by your actual code that computes the output, as a Pandas dataframe
# NB: DSS also supports other kinds of APIs for reading and writing data. Please see doc.

phone_deduplicated_df = TDD_FT_CUSTOMER_JOURNEY_joined_filtered_joined_df # For this sample code, simply copy input to output


# Write recipe outputs
phone_deduplicated = dataiku.Dataset("phone_deduplicated")
phone_deduplicated.write_with_schema(phone_deduplicated_df)

Turribeach · ‎07-07-2023

Hi Chemical, I suggest you start a new thread since your problem is different. Also you say you want to do it in SQL but you posted Python code.

chemical · ‎07-08-2023

Ciao Turribeach, I've slightly adjusted my post in order to be better understandable now, I think my question fits enough with what was asked in this thread.

Turribeach · ‎07-08-2023

Hi, because you are not the original poster of the question you can't mark the question as answered or select the correct response. So I suggest you repost as a new thread.

Remove duplicate rows in one column

Remove duplicate rows in one column

Labels

Preparation

Visual recipes

Sign up to take part

Remove duplicate rows in one column

Remove duplicate rows in one column

Labels

Preparation

Visual recipes