Is there a way to conditionally delete duplicates based on some key?

UserBird
UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
Hello,

I have some duplicated rows on some key (in this case a phone number). At the very least, I wanted to flag all but the most recent. In an ideal situation, I wanted to flag the rest based on several conditions.

Is this possible in DSS out of the box? If not, what would be the appropriate steps to take?

Thank you for your support.

PS: I'm fairly new to DSS hence the question.
Tagged:

Best Answer

  • AdrienL
    AdrienL Dataiker, Alpha Tester Posts: 196 Dataiker
    Answer ✓

    Hi,

    One way is the window recipe:

    • In the window definition, you partition by phone number and order by descending date
    • In the aggregation, you select Row number in the top "compute rank" part
    • In the post-filter, you add a filter to Keep only the rows that satisfies the condition rownumber == 1

    In the post filter, you can add other conditions at will. In the output tab, you can remove the rownumber column if yo don't need it afterwards.

Answers

  • jrouquie
    jrouquie Dataiker Alumni Posts: 87 ✭✭✭✭✭✭✭

    Depending on your use case, there are several possibilities. For a start, I would use a visual “group recipe”. It allows to keep only the line with the most recent timestamp.

    - group by phone number

    - in the recipe settings, on the line “phone number”, click “last, ordered by: <column containing time>”:

    If you need to keep all lines, once you have selected the most recent phone number into a new dataset, you could also join this new with the original one: any line of the original dataset with a match in the new dataset is a correct phone number, any other line is to be flagged.

    If you need to keep all lines, a window recipe should also do the trick, but this is more advanced.

  • UserBird
    UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
    Thanks. I'll give this a try later today
  • UserBird
    UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
    Thanks. I'll give this a try later today
  • UserBird
    UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
    Hi Adrian,

    How do I get rid of the rank column in the output section? I'm currently using DSS 2.2.

    Thanks
  • AdrienL
    AdrienL Dataiker, Alpha Tester Posts: 196 Dataiker
    Err, my bad, I thought you could unselect some columns. This is not currently possible directly in the recipe.

    You can either convert it to an SQL recipe and alter it directly if you're the coding type, else you'll have to append a preparation recipe where you delete that column.

    Or, if you work directly on the resulting dataset with an Analysis (see DSS concepts: http://learn.dataiku.com/howto/getting-started/dss-concepts/universes-and-concepts.html ) (e.g. to build a model, make charts, etc.), dropping the column could be the first step in you analysis' preparation script.
  • UserBird
    UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
    I had a feeling :). Thanks again.
Setup Info
    Tags
      Help me…