Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi Team,
I have a use case where I have to keep all the duplicated row of the dataset based on 5 columns. if any of the rows in these 5 columns are duplicated I should keep them in a single dataset. Can you please let me know on how we can achieve this using the visual recipes in the DataIKU.
Thanks in advance !
I hope you are doing well today.
I've recently been working on both simple and complex record linkage scenarios.
One of the quick ways to find rows that appear more than once in your database is to use the Distinct Visual recipe. This recipe allows you to pick the needed columns. As a result, this will give you a dataset of all of the unique combinations of the 5 columns, and importantly the number of times the combinations appear.
You can then use a visual Join recipe to:
1. filter to count > 1
2 you need to do an inner join on all of the columns you care about. This is done by clicking the "=" on the line between the two data sets.
2a. On the dialog box Choose your columns to match
2b. On the same dialog box choose that you want to do an "inner Join"
You should then be good to run the second visual Join Recipe. You just end up with the rows that are duplicated on both datasets.
Hope this helps. Let the community know how you get on with your project.
Ultimately, your flow might end up looking something like this.
I hope you are doing well today.
I've recently been working on both simple and complex record linkage scenarios.
One of the quick ways to find rows that appear more than once in your database is to use the Distinct Visual recipe. This recipe allows you to pick the needed columns. As a result, this will give you a dataset of all of the unique combinations of the 5 columns, and importantly the number of times the combinations appear.
You can then use a visual Join recipe to:
1. filter to count > 1
2 you need to do an inner join on all of the columns you care about. This is done by clicking the "=" on the line between the two data sets.
2a. On the dialog box Choose your columns to match
2b. On the same dialog box choose that you want to do an "inner Join"
You should then be good to run the second visual Join Recipe. You just end up with the rows that are duplicated on both datasets.
Hope this helps. Let the community know how you get on with your project.
Ultimately, your flow might end up looking something like this.