Submit your innovative use case or inspiring success story to the 2023 Dataiku Frontrunner Awards! LET'S GO

Remove duplicate rows in one column

Lรฉa
Level 1
Remove duplicate rows in one column
How can I remove duplicated rows in one column ?
0 Kudos
2 Replies
jereze
Community Manager
Community Manager

There is currently no way to do that in a visual preparation recipe* (because a visual recipe more or less works row by row, and it cannot work on a full column, as it is designed for big data).



It's possible to do so in a visual GROUP recipe: click โ€œShow mass actionsโ€, select all columns, click โ€œuse as grouping keysโ€. If the csv is very big, I suggest synchronizing to a SQL DB first.



You can also do so in coding recipes:




  • In a Python recipe, you can use the Pandas function (see example below) drop_duplicates()

  • In a R recipe, you have several alternative (duplicated(), dplyr, ..): read here

  • In a SQL recipe, I would use a a group by with min or max, or window function with partition by key and keep the first row.



 



 



* There is actually one way to do it in a visual preparation recipe, with a custom Python function, but that will not work all the time (if the recipe is multi-threaded), so I would not recommend this trick:





 



I hope that helps,

Jeremy

Jeremy, Product Manager at Dataiku
Mattsco
Dataiker

Hi,



In a python recipe I would do:




# -*- coding: utf-8 -*-
import dataiku
import pandas as pd

# Recipe inputs
df = dataiku.Dataset("input_dataset").get_dataframe()

df.drop_duplicates(subset=["my_key_colum"], inplace=True)
# or
# df.drop_duplicates(inplace=True)
# to use all columns to compare for duplicates

# Recipe outputs
out = dataiku.Dataset("output_dataset").write_with_schema(df)


Matt



 

Mattsco
0 Kudos

Labels

?
Labels (2)
A banner prompting to get Dataiku