New to Dataiku DSS? Try out our NEW Quick Start Programs today and get onboarded on the product in just one hour! Let's go

Remove duplicate rows in one column

Léa
Level 1
Remove duplicate rows in one column
How can I remove duplicated rows in one column ?
0 Kudos
2 Replies
jereze
Dataiker
Dataiker

There is currently no way to do that in a visual preparation recipe* (because a visual recipe more or less works row by row, and it cannot work on a full column, as it is designed for big data).



It's possible to do so in a visual GROUP recipe: click “Show mass actions”, select all columns, click “use as grouping keys”. If the csv is very big, I suggest synchronizing to a SQL DB first.



You can also do so in coding recipes:




  • In a Python recipe, you can use the Pandas function (see example below) drop_duplicates()

  • In a R recipe, you have several alternative (duplicated(), dplyr, ..): read here

  • In a SQL recipe, I would use a a group by with min or max, or window function with partition by key and keep the first row.



 



 



* There is actually one way to do it in a visual preparation recipe, with a custom Python function, but that will not work all the time (if the recipe is multi-threaded), so I would not recommend this trick:





 



I hope that helps,

Jeremy

Jeremy, Product Manager at Dataiku
Mattsco
Dataiker
Dataiker

Hi,



In a python recipe I would do:




# -*- coding: utf-8 -*-
import dataiku
import pandas as pd

# Recipe inputs
df = dataiku.Dataset("input_dataset").get_dataframe()

df.drop_duplicates(subset=["my_key_colum"], inplace=True)
# or
# df.drop_duplicates(inplace=True)
# to use all columns to compare for duplicates

# Recipe outputs
out = dataiku.Dataset("output_dataset").write_with_schema(df)


Matt



 

Mattsco
0 Kudos
Labels (2)
A banner prompting to get Dataiku DSS