Delete a row if conditions are me

cwentz · October 2021

Hello Community Members!!

I want to be able to remove a row if the ID is duplicated.

If [col] = _____ and [col] = ____ then Delete

However, delete is not the term.

Thank you for your help. Christy

tgb417 · October 2021

@cwentz

I’m wondering about using the distinct or group by recipe?

If all of the columns that share the same ID will be the same values in all of the columns for that row, you could use a distinct recipe and all the remaining rows will have only one instance of each ID and all of the values.

If not all the rows that share the same ID will have the same values in all columns, how are you going to figure out which rows or more difficult values to keep?

A group by recipe on the ID column with an added count will allow you to know which IDs are repeated. All the rows that are duplicated will end up with a count > 1.

Now the question is which row to keep. If you want to take the row with the earliest or latest admissions date for example you could do a minimum or maximum on that date column in your group by recipe. This would give you a key to decide which rows to keep. You can then use these keys that group by recipe created to do an inner join with the join recipie to filter down to just the rows identified by the group by recipe. (You still have to be careful that this key is actually unique. To resolve this you could add more columns if available.)

Here is a post on a similar topic but not the same topic. https://community.dataiku.com/t5/Using-Dataiku/Create-a-dataset-which-has-only-duplicate-rows-in-it/td-p/19230

There are definitely some other more sophisticated ways to do record linkage or record clustering in order to create canonical records from a clusters of records like you are describing. This kind of work is often done in the area of epidemiology.

Just a few thought. Hope that some of this might helps.

cwentz · October 2021

Thank you @tgb417 !!

tgb417 · October 2021

My pleasure.

For others who find this post. What of the above did you end up using in your use case.

cwentz · October 2021

The distinct reciepe

Delete a row if conditions are me

Best Answer

Answers

Categories

Setup Info

Tags