Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi Team,
I have a use case where I have to keep all the duplicated row of the dataset based on 5 columns. if any of the rows in these 5 columns are duplicated I should keep them in a single dataset. Can you please let me know on how we can achieve this using the visual recipes in the DataIKU.
Thanks in advance !
I hope you are doing well today.
I've recently been working on both simple and complex record linkage scenarios.
One of the quick ways to find rows that appear more than once in your database is to use the Distinct Visual recipe. This recipe allows you to pick the needed columns. As a result, this will give you a dataset of all of the unique combinations of the 5 columns, and importantly the number of times the combinations appear.
Screen shot of the Distinct Visual recipe. Note that the check box "Compute count of original rows for each deduplicated output row:" has been checked.
You can then use a visual Join recipe to:
1. filter to count > 1
In the prefilter part of the Join recipy the distinct dataset is filtered to count > 1
2 you need to do an inner join on all of the columns you care about. This is done by clicking the "=" on the line between the two data sets.
Join recipie screen showing how to change the Join type. On the Join tab. To change the join type. Click on the โ=โ sign between the two connection fields for your join.
2a. On the dialog box Choose your columns to match
On the Join columns dialog box this shows how to use the + icon to setup any number of columns to match on.
2b. On the same dialog box choose that you want to do an "inner Join"
This show the join dialog box and the second tab that allows one to choose an inner Join not a left join .
You should then be good to run the second visual Join Recipe. You just end up with the rows that are duplicated on both datasets.
Hope this helps. Let the community know how you get on with your project.
Ultimately, your flow might end up looking something like this.
Dataiku flow connecting a Contact Data Source to a Distinct dataset, then joining the Distinct dataset to the original dataset to find only duplicated lines.
I hope you are doing well today.
I've recently been working on both simple and complex record linkage scenarios.
One of the quick ways to find rows that appear more than once in your database is to use the Distinct Visual recipe. This recipe allows you to pick the needed columns. As a result, this will give you a dataset of all of the unique combinations of the 5 columns, and importantly the number of times the combinations appear.
Screen shot of the Distinct Visual recipe. Note that the check box "Compute count of original rows for each deduplicated output row:" has been checked.
You can then use a visual Join recipe to:
1. filter to count > 1
In the prefilter part of the Join recipy the distinct dataset is filtered to count > 1
2 you need to do an inner join on all of the columns you care about. This is done by clicking the "=" on the line between the two data sets.
Join recipie screen showing how to change the Join type. On the Join tab. To change the join type. Click on the โ=โ sign between the two connection fields for your join.
2a. On the dialog box Choose your columns to match
On the Join columns dialog box this shows how to use the + icon to setup any number of columns to match on.
2b. On the same dialog box choose that you want to do an "inner Join"
This show the join dialog box and the second tab that allows one to choose an inner Join not a left join .
You should then be good to run the second visual Join Recipe. You just end up with the rows that are duplicated on both datasets.
Hope this helps. Let the community know how you get on with your project.
Ultimately, your flow might end up looking something like this.
Dataiku flow connecting a Contact Data Source to a Distinct dataset, then joining the Distinct dataset to the original dataset to find only duplicated lines.