Create a dataset which has only duplicate rows in it.

Options
Tsurapaneni
Tsurapaneni Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 41 ✭✭✭✭

Hi Team,

I have a use case where I have to keep all the duplicated row of the dataset based on 5 columns. if any of the rows in these 5 columns are duplicated I should keep them in a single dataset. Can you please let me know on how we can achieve this using the visual recipes in the DataIKU.

Thanks in advance !

Best Answer

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Answer ✓
    Options

    @Tsurapaneni
    ,

    I hope you are doing well today.

    I've recently been working on both simple and complex record linkage scenarios.

    One of the quick ways to find rows that appear more than once in your database is to use the Distinct Visual recipe. This recipe allows you to pick the needed columns. As a result, this will give you a dataset of all of the unique combinations of the 5 columns, and importantly the number of times the combinations appear.

    Visual Distinct.png

    You can then use a visual Join recipe to:

    1. filter to count > 1

    Join with distinct group filtered to Count greater than 1.png

    2 you need to do an inner join on all of the columns you care about. This is done by clicking the "=" on the line between the two data sets.

    To setup an inner join.png

    2a. On the dialog box Choose your columns to match

    Setting up columns to match on.png

    2b. On the same dialog box choose that you want to do an "inner Join"

    Choosing an Inner Join.png

    You should then be good to run the second visual Join Recipe. You just end up with the rows that are duplicated on both datasets.

    Hope this helps. Let the community know how you get on with your project.

    Ultimately, your flow might end up looking something like this.

    Find Duplicate Flow.png

Answers

Setup Info
    Tags
      Help me…