Remove duplicate rows with two reference columns

Options
Amarilla
Amarilla Registered Posts: 11 ✭✭✭✭

Hey DSS Community!

I would like to know how to remove the duplicate number of lines and taking into account two reference columns?

It seems to me that it is necessary to use a python formula but I am not sufficiently trained in this language. Is this the only possibility?

Thanks in advance: D

Answers

  • Andrey
    Andrey Dataiker Alumni Posts: 119 ✭✭✭✭✭✭✭
    Options

    Hi,

    Depending on what kind of result you're trying to achieve you may try:

    - "Distinct" recipe which allows to select the columns that you want to be distinct. In this case the output schema will only consist of the columns you chose

    - "Group" recipe where you group on selected keys and chose for example "First" for the rest of the columns. In this case the output schema will contain all of the columns.

    Regards

  • Amarilla
    Amarilla Registered Posts: 11 ✭✭✭✭
    Options

    Thanks for your feedback!

    I think the "Dinstict" recipe is what I want to do.

    I had already tried it and when I left it, the recipe overloaded a lot of data from my dataset.

    I tried without adding anything and adding filters but the results are the same

  • Jurre
    Jurre Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Registered, Dataiku DSS Developer, Neuron 2022 Posts: 114 ✭✭✭✭✭✭✭
    Options

    Hi @Amarilla
    ,

    I find @Andrey
    's "Group by" suggestion very helpful for removing and doublechecking duplicates. Group on the column that needs to hold distinct values, the "concat" option gives you a column in which reference-values can be concatenated.. Make sure to set a suitable separator and possibly check "concat distinct". See the attached picture for a clue where to find those options.

    Hope this helps!

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 411 Neuron
    Options

    Hi @Amarilla
    ,

    What do you mean by "overloaded a lot of data". Are you getting any results or the recipe is failing?

    Cheers!

  • Amarilla
    Amarilla Registered Posts: 11 ✭✭✭✭
    Options

    Dear @Ignacio_Toledo
    ,

    I get many results however they are not complete because data is deleted during the realization of the recipe

    Thank you!

  • Amarilla
    Amarilla Registered Posts: 11 ✭✭✭✭
    Options

    Hi @Jurre

    Thanks for your return, i have tried this solution and actually maybe it can work.

    How to select all the output columns? Because I only get the one from my group key

    Thank you in advance

  • Jurre
    Jurre Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Registered, Dataiku DSS Developer, Neuron 2022 Posts: 114 ✭✭✭✭✭✭✭
    Options

    Hi @Amarilla
    ,

    Last screenshot was part of data i cannot share here, so a better one is attached now. It shows the screen you see when setting up the GroupBy-processor. Under "Per Field aggregations" the columns of the dataset show up. As you can see i selected a number of columns by activating the Avg-option, those will show up in the resulting dataset.

    Because grouping is a form of aggregation, you should specify what to do with the values in your columns. @Andrey
    suggested "First" , which just returns the first value the processor encounters. It depends on your specific data-wishes what to select here, so play around with it a little to get the result you want. After all : you can always go back and change what you specified earlier.

    Hope this helps, have fun!

  • Amarilla
    Amarilla Registered Posts: 11 ✭✭✭✭
    Options

    Thank you very much, so I will play with these different possibilities and see what I get out of them.

    Thank you again for this response!

Setup Info
    Tags
      Help me…