Duplicate rows need to remove or replace value

Options
ESoto
ESoto Registered Posts: 15

My rows are repeating information over and over again because I now have two columns that have a computer name. The one column has different computer names (information from another database) and because of this it is duplicating the results to put in a value for the computer names that are different.

I need to get rid of the duplicates (it keeps repeating for example every 3 lines so if a user has only 3 applications, they are using it is repeating over and over again). I cannot paste any screenshots for security reasons. But here is an example for one user (the other database shows all devices user has but does not know about the applications used):

computer name device name application usage session

sseisssss sseisssss photoshop 2023 3

xdeddede photoshop 2023 3

Basically, because there is a different computer name it will repeat that the user has photoshop 2023 when I do not want this repeat and since it is repeating it is also duplicating how many sessions the user is using the application. I tried the distinct recipe, but I need to keep all the rows and columns and it only outputs the distinct column. I am also not sure how to accomplish this in the group recipe.

I do see that there is a Python Function option to remove the duplicates of one column, how would I do it for all columns? Any help would be appreciated as I cannot move on to complete my analysis without solving this, thank you.


Operating system used: Windows 11 Enterprise

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,708 Neuron
    Options

    I don't really know how you want to deduplicte this data. But assuming you want to keep the first row you can do a max(computer name), remove device name as a column and group by all the other columns to get a distinct value. A group by recipe should do this.

  • ESoto
    ESoto Registered Posts: 15
    Options

    I would rather not remove the device name column because I need this to show that there are possibly other devices not being accounted for that should be in the future. So I really just want to remove the duplicates in concerns of the applications and the usage sessions but the column needs to stay put.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,708 Neuron
    Options

    Show us how you expect the deduplicated row to look. What exact values and columns do you expect to see how would you have a row with two device names.

  • ESoto
    ESoto Registered Posts: 15
    Options

    I basically would need this, so it is actually accurate to the information:

    computer name device name application usage session

    sseisssss sseisssss photoshop 2023 3

    xdeddede

    or if it can not get rid of the duplicate then I need the 3 replaced with a different value:

    computer name device name application usage session

    sseisssss sseisssss photoshop 2023 3

    xdeddede photoshop 2023 N/A

    That way it can keep the information that the user has multiple devices but is not comprising the integrity of the data. I do not mind having to manually change the 3 to N/A for each part I see it repeating just would also need a solution to do this besides exporting and doing it in Excel.

Setup Info
    Tags
      Help me…