How to identify duplicates in a data set?

Shannie · May 2022

Normally with Excel I would highlight duplicate values and as they require deeper research I do not want to delete them. What would be the best approach in Dataiku to accomplish this? I see I can use values cluster in the analyze dropdown, but I want to see these as part of the larger set. Is there a way to export the results of the cluster also?

tgb417 · May 2022

@Shannie

Welcome to the Dataiku community.

I use a variety of tools when doing duplicate matching or record clustering.

Here is a knowledge base article about duplicate management.

https://knowledge.dataiku.com/latest/courses/resources/excel-to-dss/duplicates.html

Here is a community posting on this subject.

https://community.dataiku.com/t5/Using-Dataiku/Remove-Duplicates-based-on-one-column/td-p/9554

If I don’t have a unique record key I’ll typically create one in a visual recipe as described in this thread.

https://community.dataiku.com/t5/Using-Dataiku/Complex-filters-or-queries-on-a-dataset/m-p/25709#M9851

I’ll then continue with a visual prepare recipe to create a synthetic join key. This might be from a single column or multiple columns. It might be the whole column or based on fragment. A simple example might be brown/t/19086. These are usually text stings and I’ll also cleanup by trimming removing special characters on others of the NLP tools provided.

Then I’ll either do a self join on the table or use the visual window or group recipie depending on what I’m trying to do. It will be these records that share the same synthetic key or that I can do a fuzzy join on the synthetic key that will be my dups.

There is another more advanced technique that I have been using recently in python recipes and python jupyter notebooks. It is based on the python library pandas-dedupe library

There has also been a bit of a discussion about creating a community plugin to provide support in this area.

https://community.dataiku.com/t5/Product-Ideas/Entity-Resolution-Record-Linkage-Plug-In/idi-p/19821

Hope that some of this is a bit of support. Please share a bit more about what you are trying to do other may jump in and provide other insights.

Shannie · May 2022

Thank you @tgb417
!!! Extremely helpful and much appreciated.

How to identify duplicates in a data set?

Best Answer

Answers

Categories

Setup Info

Tags