Remove duplicate rows in one column
Answers
-
There is currently no way to do that in a visual preparation recipe* (because a visual recipe more or less works row by row, and it cannot work on a full column, as it is designed for big data).
It's possible to do so in a visual GROUP recipe: click “Show mass actions”, select all columns, click “use as grouping keys”. If the csv is very big, I suggest synchronizing to a SQL DB first.
You can also do so in coding recipes:
- In a Python recipe, you can use the Pandas function (see example below) drop_duplicates()
- In a R recipe, you have several alternative (duplicated(), dplyr, ..): read here
- In a SQL recipe, I would use a a group by with min or max, or window function with partition by key and keep the first row.
* There is actually one way to do it in a visual preparation recipe, with a custom Python function, but that will not work all the time (if the recipe is multi-threaded), so I would not recommend this trick:
I hope that helps,
Jeremy -
Hi,
In a python recipe I would do:
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd
# Recipe inputs
df = dataiku.Dataset("input_dataset").get_dataframe()
df.drop_duplicates(subset=["my_key_colum"], inplace=True)
# or
# df.drop_duplicates(inplace=True)
# to use all columns to compare for duplicates
# Recipe outputs
out = dataiku.Dataset("output_dataset").write_with_schema(df)Matt
-
Hello, sorry bringing the thread up, I'd need support because I'm trying deduplicating my output with Python but I can't understand how to edit the code. I need to deduplicate TDD_FT_CUSTOMER_JOURNEY_joined_filtered_joined because - as you can see from the screen below - I have for the same ID (first column) duplicate values. Indeed the column CD_BOOKING also has the same repeating values.
This is the Pyton code that I haven't touched yet.
# -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu # Read recipe inputs TDD_FT_CUSTOMER_JOURNEY_joined_filtered_joined = dataiku.Dataset("TDD_FT_CUSTOMER_JOURNEY_joined_filtered_joined") TDD_FT_CUSTOMER_JOURNEY_joined_filtered_joined_df = TDD_FT_CUSTOMER_JOURNEY_joined_filtered_joined.get_dataframe() # Compute recipe outputs from inputs # TODO: Replace this part by your actual code that computes the output, as a Pandas dataframe # NB: DSS also supports other kinds of APIs for reading and writing data. Please see doc. phone_deduplicated_df = TDD_FT_CUSTOMER_JOURNEY_joined_filtered_joined_df # For this sample code, simply copy input to output # Write recipe outputs phone_deduplicated = dataiku.Dataset("phone_deduplicated") phone_deduplicated.write_with_schema(phone_deduplicated_df)
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,090 Neuron
Hi Chemical, I suggest you start a new thread since your problem is different. Also you say you want to do it in SQL but you posted Python code.
-
Ciao Turribeach, I've slightly adjusted my post in order to be better understandable now, I think my question fits enough with what was asked in this thread.
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,090 Neuron
Hi, because you are not the original poster of the question you can't mark the question as answered or select the correct response. So I suggest you repost as a new thread.