Survey banner
The Dataiku Community is moving to a new home! We are temporary in read only mode: LEARN MORE

Duplicate removal

Karthikeyanvenk
Level 1
Duplicate removal

how can i remove duplicate row based on single column without python code

0 Kudos
2 Replies
louisbarjon
Dataiker

Hello,

You can use a group recipe on this column. You will have only unique rows for this column. But then you need to decide what you want to do with the cells from the other columns.

DSS provides lot of choices :

Screenshot 2024-05-29 at 11.23.28.png

To explain some of them :

  1. Concat will just concatenate all values from other rows, you can specify the separator
  2. Avg: For numerical types such as integer, you can compute the average
  3. Distinct will just compute the number of distinct values found on this column
  4. For the rest, have a look at the documentation

 

 

0 Kudos
tgb417

@Karthikeyanvenk ,

Welcome to the Dataiku Community.  We are so glad to have you join us.

There are a number of ways to remove duplicates.

Some are described in this thread.

https://community.dataiku.com/t5/Using-Dataiku/How-to-identify-duplicates-in-a-data-set/m-p/25831

When it comes to reliably removing duplicates and I in the case where I know how to order the duplicate records to keep the ones I want and remove the rest, I tend to use the Window Recipe.

I tend to use the method described in this community post.  

https://community.dataiku.com/t5/Using-Dataiku/Is-there-a-way-to-conditionally-delete-duplicates-bas...

I also note that there is a distinct visual recipe.  I think that this was added to Dataiku DSS after I learned the window trick.

https://knowledge.dataiku.com/latest/data-preparation/visual-recipes/tutorial-distinct-recipe.html

Hope one of these ways helps.  Let us know how you are getting along with the project you are working on.

--Tom
0 Kudos