Create a dataset which has only duplicate rows in it.
Hi Team,
I have a use case where I have to keep all the duplicated row of the dataset based on 5 columns. if any of the rows in these 5 columns are duplicated I should keep them in a single dataset. Can you please let me know on how we can achieve this using the visual recipes in the DataIKU.
Thanks in advance !
Best Answer
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
I hope you are doing well today.
I've recently been working on both simple and complex record linkage scenarios.
One of the quick ways to find rows that appear more than once in your database is to use the Distinct Visual recipe. This recipe allows you to pick the needed columns. As a result, this will give you a dataset of all of the unique combinations of the 5 columns, and importantly the number of times the combinations appear.
You can then use a visual Join recipe to:
1. filter to count > 1
2 you need to do an inner join on all of the columns you care about. This is done by clicking the "=" on the line between the two data sets.
2a. On the dialog box Choose your columns to match
2b. On the same dialog box choose that you want to do an "inner Join"
You should then be good to run the second visual Join Recipe. You just end up with the rows that are duplicated on both datasets.
Hope this helps. Let the community know how you get on with your project.
Ultimately, your flow might end up looking something like this.