Delete a row if conditions are me

Solved!
cwentz
Level 3
Delete a row if conditions are me

Hello Community Members!!

 

I want to be able to remove a row if the ID is duplicated.

If [col] = _____ and [col] =  ____ then Delete

However, delete is not the term. 

Thank you for your help. Christy 

 

0 Kudos
1 Solution
tgb417

@cwentz 

Iโ€™m wondering about using the distinct or group by recipe?

If all of the columns that share the same ID will be the same values in all of the columns for that row, you could use a distinct recipe and all the remaining rows will have only one instance of each ID and all of the values.

If not all the rows that share the same ID will have the same values in all columns, how are you going to figure out which rows or more difficult values to keep?

A group by recipe on the ID column with an added count will allow you to know which IDs are repeated.  All the rows that are duplicated will end up with a count > 1.

Now the question is which row to keep.  If you want to take the row with the earliest or latest admissions date for example you could do a minimum or maximum on that date column in your group by recipe.  This would give you a key to decide which rows to keep.  You can then use these keys that group by recipe created to do an inner join with the join recipie to filter down to just the rows identified by the group by recipe. (You still have to be careful that this key is actually unique.  To resolve this you could add more columns if available.)

Here is a post on a similar topic but not the same topic.  https://community.dataiku.com/t5/Using-Dataiku/Create-a-dataset-which-has-only-duplicate-rows-in-it/...

There are definitely some other more sophisticated ways to do record linkage or record clustering in order to create canonical records from a clusters of records like you are describing.  This kind of work is often done in the area of epidemiology.  

Just a few thought.  Hope that some of this might helps.

--Tom

View solution in original post

0 Kudos
4 Replies
tgb417

@cwentz 

Iโ€™m wondering about using the distinct or group by recipe?

If all of the columns that share the same ID will be the same values in all of the columns for that row, you could use a distinct recipe and all the remaining rows will have only one instance of each ID and all of the values.

If not all the rows that share the same ID will have the same values in all columns, how are you going to figure out which rows or more difficult values to keep?

A group by recipe on the ID column with an added count will allow you to know which IDs are repeated.  All the rows that are duplicated will end up with a count > 1.

Now the question is which row to keep.  If you want to take the row with the earliest or latest admissions date for example you could do a minimum or maximum on that date column in your group by recipe.  This would give you a key to decide which rows to keep.  You can then use these keys that group by recipe created to do an inner join with the join recipie to filter down to just the rows identified by the group by recipe. (You still have to be careful that this key is actually unique.  To resolve this you could add more columns if available.)

Here is a post on a similar topic but not the same topic.  https://community.dataiku.com/t5/Using-Dataiku/Create-a-dataset-which-has-only-duplicate-rows-in-it/...

There are definitely some other more sophisticated ways to do record linkage or record clustering in order to create canonical records from a clusters of records like you are describing.  This kind of work is often done in the area of epidemiology.  

Just a few thought.  Hope that some of this might helps.

--Tom
0 Kudos
cwentz
Level 3
Author

Thank you @tgb417 !!

0 Kudos
tgb417

My pleasure.

For others who find this post.  What of the above did you end up using in your use case.

--Tom
0 Kudos
cwentz
Level 3
Author

The distinct reciepe

0 Kudos