Remove Duplicates based on one column

Renga3037
Renga3037 Registered Posts: 9 ✭✭✭✭

Hi All

Here is my requirement.

I wanna remove duplicate rows based on one column. Is there any way to do in DSS recipes.

Please advise

Thanks in advance

Answers

  • GCase
    GCase Dataiker, PartnerAdmin, Registered Posts: 27 Dataiker

    @Renga3037
    in the "Distinct" visual recipe you can choose either to remove duplicates based on all columns or choose a subset including one column. If you choose to use one column it will return only that one column and just the distinct values. If you need to create some logic (like the first value based on some sort of Sort) then you should look at the Window recipe which allows you to choose First, Last, Lag, etc.

  • Renga3037
    Renga3037 Registered Posts: 9 ✭✭✭✭

    @GCase
    Agreed, but I want all the column as output not only distinct column !

  • GCase
    GCase Dataiker, PartnerAdmin, Registered Posts: 27 Dataiker

    Can you be more specific

    For example, assume ID is the column you want to have uniques

    ID First_Name Last_Name Year_Entered
    1 Lebron James 2004
    2 Michael Jordan 1985
    2 Larry Bird 1980

    What would the dataset you returned look like based on that list?

    @Renga3037

  • Renga3037
    Renga3037 Registered Posts: 9 ✭✭✭✭

    @GCase

    Correct, I want unique ID (removing duplicates)

    Result would be like this

    1 Lebron James 2004

    2 Michal Jordan 1985

    Hope you get that

  • GCase
    GCase Dataiker, PartnerAdmin, Registered Posts: 27 Dataiker

    Why Michael Jordan? Was it because that was the first row or some other reason?

    Grant

    @Renga3037

  • Mehdi
    Mehdi Registered Posts: 2 ✭✭✭✭

    I think it's because it keeps the first distinct ID that he sees. @GCase

    Unfortunately, the distinct recipe as it is in DSS won't allow this.

    Two solutions in my mind:

    Convert this request in SQL , something like:

    SELECT DISTINCT ON (your_column) your_table.*

    FROM your_table

    ORDER BY your_column;

    OR

    Add a windows recipes before and compute a column with a rank, which will be used in the distinct recipe as pre-filter.

    @Renga3037

  • Tut
    Tut Registered Posts: 1 ✭✭✭

    I had the same problem... Too bad it was not answered. I don't understand why DSS keeps only the distincts columns...

  • Jurre
    Jurre Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Registered, Dataiku DSS Developer, Neuron 2022 Posts: 115 ✭✭✭✭✭✭✭

    Maybe a grouping-recipe can be of help ? Group on the ID column and set aggregations for other columns to "first". That will leave you with unique values in the ID column and everything but the first value of duplicates filtered out.

  • Muhanned
    Muhanned Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 2 Partner

    Yes, I agree with @Jurre
    . You achieve the task of removing duplicates based on one column while keeping all other columns data in the output dataset by using the trick she explained in "Group" recipe.

    Alternatively, you can use code recipe: Remove duplicate rows in one column - Dataiku Community

    For Dataiku developers, yes, I think it would be useful to update the "Distinct" recipe such as you can choose the output of recipe to easily achieve this task (as the "Distinct" recipe is what comes to my mind to do such tasks).

    Just for clarification, the task is:

    - find unique combinations (based on one or more columns)

    - keep one row of each combination WITH ASSOCIATED data from other columns (i.e., keep the whole original row)

  • Jurre
    Jurre Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Registered, Dataiku DSS Developer, Neuron 2022 Posts: 115 ✭✭✭✭✭✭✭

    Welcome @Muhanned
    !

    If you have a clear idea about improvements for the Distinct-recipe please share it on the product idea pages, we can't expect our dear DataIkers to scan every conversation for possible improvement-proposals.

    Personally i'm a big fan of grouping together with options like concat and it's suboption "concat distinct". With a following prepare recipe values in such a concatenated column can be distributed again. But there are multiple ways to get some solid results.

  • Muhanned
    Muhanned Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 2 Partner

    Great! thanks for sharing the link @Jurre

  • Ioannis
    Ioannis Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 28 ✭✭✭✭✭

    same issue here. Is there any update on this?

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
  • mahesh059
    mahesh059 Registered Posts: 1

    Hi All, I have a situation to filter our duplicate projects in UI along with owners information, so that I can inform the concerned owners to delete their duplicate projects from their application to save disk space. Is there a way we can sort it down.

    I can see two options in front of me as below. My Dataiku version is - 11.0.3

    Administration >Monitoring >Summary > Scroll down to check all the project names - Here neither we can see the Project owner information nor duplicates

    Administration > Projects > Click on each project and check whether it's duplicated or not along with owners or who created the project.

    Is there any other way we can simply filter it down , please suggest ?

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,032 Neuron

    Please start a new thread as your question is totally different than this thread.

Setup Info
    Tags
      Help me…