How to identify duplicates in a data set?

Solved!
Shannie
Level 1
How to identify duplicates in a data set?

Normally with Excel I would highlight duplicate values and as they require deeper research I do not want to delete them. What would be the best approach in Dataiku to accomplish this? I see I can use values cluster in the analyze dropdown, but I want to see these as part of the larger set. Is there a way to export the results of the cluster also?

0 Kudos
1 Solution
tgb417

@Shannie 

Welcome to the Dataiku community.

I use a variety of tools when doing duplicate matching or record clustering.

Here is a knowledge base article about duplicate management.

https://knowledge.dataiku.com/latest/courses/resources/excel-to-dss/duplicates.html

Here is a community posting on this subject.

https://community.dataiku.com/t5/Using-Dataiku/Remove-Duplicates-based-on-one-column/td-p/9554 

If I donโ€™t have a unique record key Iโ€™ll typically create one in a visual recipe as described in this thread.

https://community.dataiku.com/t5/Using-Dataiku/Complex-filters-or-queries-on-a-dataset/m-p/25709#M98...

Iโ€™ll then continue with a visual prepare recipe to create a synthetic join key.  This might be from a single column or multiple columns.  It might be the whole column or based on fragment.  A simple example might be brown/t/19086. These are usually text stings and Iโ€™ll also cleanup by trimming removing special characters on others of the NLP tools provided.  

Then Iโ€™ll either do a self join on the table or use the visual window or group recipie depending on what Iโ€™m trying to do. It will be these records that share the same synthetic key or that I can do a fuzzy join on the synthetic key that will be my dups.

There is another more advanced technique that I have been using recently in python recipes and python jupyter notebooks.  It is based on the python library pandas-dedupe library 

There has also been a bit of a discussion about creating a community plugin to provide support in this area.  

https://community.dataiku.com/t5/Product-Ideas/Entity-Resolution-Record-Linkage-Plug-In/idi-p/19821 

Hope that some of this is a bit of support.  Please share a bit more about what you are trying to do other may jump in and provide other insights. 

--Tom

View solution in original post

2 Replies
tgb417

@Shannie 

Welcome to the Dataiku community.

I use a variety of tools when doing duplicate matching or record clustering.

Here is a knowledge base article about duplicate management.

https://knowledge.dataiku.com/latest/courses/resources/excel-to-dss/duplicates.html

Here is a community posting on this subject.

https://community.dataiku.com/t5/Using-Dataiku/Remove-Duplicates-based-on-one-column/td-p/9554 

If I donโ€™t have a unique record key Iโ€™ll typically create one in a visual recipe as described in this thread.

https://community.dataiku.com/t5/Using-Dataiku/Complex-filters-or-queries-on-a-dataset/m-p/25709#M98...

Iโ€™ll then continue with a visual prepare recipe to create a synthetic join key.  This might be from a single column or multiple columns.  It might be the whole column or based on fragment.  A simple example might be brown/t/19086. These are usually text stings and Iโ€™ll also cleanup by trimming removing special characters on others of the NLP tools provided.  

Then Iโ€™ll either do a self join on the table or use the visual window or group recipie depending on what Iโ€™m trying to do. It will be these records that share the same synthetic key or that I can do a fuzzy join on the synthetic key that will be my dups.

There is another more advanced technique that I have been using recently in python recipes and python jupyter notebooks.  It is based on the python library pandas-dedupe library 

There has also been a bit of a discussion about creating a community plugin to provide support in this area.  

https://community.dataiku.com/t5/Product-Ideas/Entity-Resolution-Record-Linkage-Plug-In/idi-p/19821 

Hope that some of this is a bit of support.  Please share a bit more about what you are trying to do other may jump in and provide other insights. 

--Tom
Shannie
Level 1
Author

Thank you @tgb417 !!! Extremely helpful and much appreciated.