Remove duplicate rows with two reference columns

Amarilla
Level 2
Remove duplicate rows with two reference columns

Hey DSS Community!

I would like to know how to remove the duplicate number of lines and taking into account two reference columns?

It seems to me that it is necessary to use a python formula but I am not sufficiently trained in this language. Is this the only possibility?

Thanks in advance: D

0 Kudos
8 Replies
Andrey
Dataiker Alumni

Hi,

Depending on what kind of result you're trying to achieve you may try:

- "Distinct" recipe which allows to select the columns that you want to be distinct. In this case the output schema will only consist of the columns you chose

- "Group" recipe where you group on selected keys and chose for example "First" for the rest of the columns. In this case the output schema will contain all of the columns.

Regards

Andrey Avtomonov
R&D Engineer @ Dataiku
0 Kudos
Amarilla
Level 2
Author

Thanks for your feedback!

I think the "Dinstict" recipe is what I want to do.

I had already tried it and when I left it, the recipe overloaded a lot of data from my dataset.

I tried without adding anything and adding filters but the results are the same 😞

0 Kudos

Hi @Amarilla,

What do you mean by "overloaded a lot of data". Are you getting any results or the recipe is failing?

Cheers!

0 Kudos
Amarilla
Level 2
Author

Dear @Ignacio_Toledo ,

I get many results however they are not complete because data is deleted during the realization of the recipe

Thank you!

Jurre
Level 5

Hi @Amarilla , 

I find  @Andrey 's "Group by" suggestion very helpful for removing and doublechecking duplicates. Group on the column that needs to hold distinct values, the "concat" option gives you a column in which reference-values can be concatenated.. Make sure to set a suitable separator and possibly check "concat distinct". See the attached picture for a clue where to find those options.

Hope this helps! 

0 Kudos
Amarilla
Level 2
Author

Hi @Jurre 

Thanks for your return, i have tried this solution and actually maybe it can work.

How to select all the output columns? Because I only get the one from my group key

Thank you in advance 🙂

0 Kudos
Jurre
Level 5

Hi @Amarilla ,

Last screenshot was part of data i cannot share here, so a better one is attached now. It shows the screen you see when setting up the GroupBy-processor. Under "Per Field aggregations" the columns of the dataset show up. As you can see i selected a number of columns by activating the Avg-option, those will show up in the resulting dataset.

Because grouping is a form of aggregation, you should specify what to do with the values in your columns. @Andrey suggested "First" , which just returns the first value the processor encounters. It depends on your specific data-wishes what to select here, so play around with it a little to get the result you want. After all : you can always go back and change what you specified earlier. 

Hope this helps, have fun!

 

 

Amarilla
Level 2
Author

Thank you very much, so I will play with these different possibilities and see what I get out of them.

Thank you again for this response!