How to identify duplicates in a data set?
Normally with Excel I would highlight duplicate values and as they require deeper research I do not want to delete them. What would be the best approach in Dataiku to accomplish this? I see I can use values cluster in the analyze dropdown, but I want to see these as part of the larger set. Is there a way to export the results of the cluster also?
Best Answer
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
Welcome to the Dataiku community.
I use a variety of tools when doing duplicate matching or record clustering.
Here is a knowledge base article about duplicate management.https://knowledge.dataiku.com/latest/courses/resources/excel-to-dss/duplicates.html
Here is a community posting on this subject.
https://community.dataiku.com/t5/Using-Dataiku/Remove-Duplicates-based-on-one-column/td-p/9554
If I don’t have a unique record key I’ll typically create one in a visual recipe as described in this thread.
I’ll then continue with a visual prepare recipe to create a synthetic join key. This might be from a single column or multiple columns. It might be the whole column or based on fragment. A simple example might be brown/t/19086. These are usually text stings and I’ll also cleanup by trimming removing special characters on others of the NLP tools provided.
Then I’ll either do a self join on the table or use the visual window or group recipie depending on what I’m trying to do. It will be these records that share the same synthetic key or that I can do a fuzzy join on the synthetic key that will be my dups.
There is another more advanced technique that I have been using recently in python recipes and python jupyter notebooks. It is based on the python library pandas-dedupe library
There has also been a bit of a discussion about creating a community plugin to provide support in this area.
https://community.dataiku.com/t5/Product-Ideas/Entity-Resolution-Record-Linkage-Plug-In/idi-p/19821
Hope that some of this is a bit of support. Please share a bit more about what you are trying to do other may jump in and provide other insights.
Answers
-
Thank you @tgb417
!!! Extremely helpful and much appreciated.