Submit your inspiring success story or innovative use case to the 2022 Dataiku Frontrunner Awards! ENTER YOUR SUBMISSION

How to identify duplicates in a data set?

Solved!
Shannie
Level 1
How to identify duplicates in a data set?

Normally with Excel I would highlight duplicate values and as they require deeper research I do not want to delete them. What would be the best approach in Dataiku to accomplish this? I see I can use values cluster in the analyze dropdown, but I want to see these as part of the larger set. Is there a way to export the results of the cluster also?

0 Kudos
1 Solution
tgb417
Neuron
Neuron

@Shannie 

Welcome to the Dataiku community.

I use a variety of tools when doing duplicate matching or record clustering.

Here is a knowledge base article about duplicate management.

https://knowledge.dataiku.com/latest/courses/resources/excel-to-dss/duplicates.html

Here is a community posting on this subject.

https://community.dataiku.com/t5/Using-Dataiku/Remove-Duplicates-based-on-one-column/td-p/9554 

If I don’t have a unique record key I’ll typically create one in a visual recipe as described in this thread.

https://community.dataiku.com/t5/Using-Dataiku/Complex-filters-or-queries-on-a-dataset/m-p/25709#M98...

I’ll then continue with a visual prepare recipe to create a synthetic join key.  This might be from a single column or multiple columns.  It might be the whole column or based on fragment.  A simple example might be brown/t/19086. These are usually text stings and I’ll also cleanup by trimming removing special characters on others of the NLP tools provided.  

Then I’ll either do a self join on the table or use the visual window or group recipie depending on what I’m trying to do. It will be these records that share the same synthetic key or that I can do a fuzzy join on the synthetic key that will be my dups.

There is another more advanced technique that I have been using recently in python recipes and python jupyter notebooks.  It is based on the python library pandas-dedupe library 

There has also been a bit of a discussion about creating a community plugin to provide support in this area.  

https://community.dataiku.com/t5/Product-Ideas/Entity-Resolution-Record-Linkage-Plug-In/idi-p/19821 

Hope that some of this is a bit of support.  Please share a bit more about what you are trying to do other may jump in and provide other insights. 

--Tom

View solution in original post

2 Replies
tgb417
Neuron
Neuron

@Shannie 

Welcome to the Dataiku community.

I use a variety of tools when doing duplicate matching or record clustering.

Here is a knowledge base article about duplicate management.

https://knowledge.dataiku.com/latest/courses/resources/excel-to-dss/duplicates.html

Here is a community posting on this subject.

https://community.dataiku.com/t5/Using-Dataiku/Remove-Duplicates-based-on-one-column/td-p/9554 

If I don’t have a unique record key I’ll typically create one in a visual recipe as described in this thread.

https://community.dataiku.com/t5/Using-Dataiku/Complex-filters-or-queries-on-a-dataset/m-p/25709#M98...

I’ll then continue with a visual prepare recipe to create a synthetic join key.  This might be from a single column or multiple columns.  It might be the whole column or based on fragment.  A simple example might be brown/t/19086. These are usually text stings and I’ll also cleanup by trimming removing special characters on others of the NLP tools provided.  

Then I’ll either do a self join on the table or use the visual window or group recipie depending on what I’m trying to do. It will be these records that share the same synthetic key or that I can do a fuzzy join on the synthetic key that will be my dups.

There is another more advanced technique that I have been using recently in python recipes and python jupyter notebooks.  It is based on the python library pandas-dedupe library 

There has also been a bit of a discussion about creating a community plugin to provide support in this area.  

https://community.dataiku.com/t5/Product-Ideas/Entity-Resolution-Record-Linkage-Plug-In/idi-p/19821 

Hope that some of this is a bit of support.  Please share a bit more about what you are trying to do other may jump in and provide other insights. 

--Tom
Shannie
Level 1
Author

Thank you @tgb417 !!! Extremely helpful and much appreciated.