Counting Rows

davidkubica1
davidkubica1 Registered Posts: 2 ✭✭✭

I have a dataset. Four of the columns are Location Number, Employee ID, Date, and Symbol. I would like to create a 5th column that contains counts of how many times a row with matching data in all four columns appears. For example, there are 5 rows with Location Number=308, Employee ID=37, Date=03/23/2007 and Symbol=TP. The 5th column therefore is "5".

Any suggestions on how I could go about doing this?

Thanks in advance!

Answers

  • tim-wright
    tim-wright Partner, L2 Designer, Snowflake Advanced, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 77 Partner

    @davidkubica1
    I think you should be able to accomplish this using a GroupBy Recipe followed by a join Recipe.

    1. Group your dataset with the visual group by recipe on columns: Location Number, Employee ID, Date, and Symbol. Use "count" as the aggregation metric. This should result in a dataset that has the number of times each combination of those values occurs (with one row per unique combination of your group by keys in the output).

    2. Left join the output of the GroupBy with your original dataset.

    I think that should give you what you're after.

    I'm not 100% knowledgeable about how the group by/joins work when using dates. Depending on if your original date column is an actual date (or a date represented as a string) you may have to convert those to string before step 1 and then can convert back to Date after step 2.

    -Tim

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @davidkubica1

    Welcome to the Dataiku community.

    I'm not at my desktop computer at the moment, so I can’t send along any pictures or anything.

    However, you might check out the group by visual recipe.

    There is an option where you can select the 4 columns you want to “group by”.

    Then select the check box that will tell you the number of time the set of 4 columns appears.

    Run the recipe. This will give you fewer rows of data than was in your original dataset. One for each unique set of 4 columns and the count of those occurrences.

    If you need to have all of the original rows with this added as an extra column. You should be able to use a join recipient to connect this summary count back to each of the original records.

    There may be other ways to get this done more efficiently. Others please feel free to jump in with other ideas.

    Please let us know how you get on with this and welcome to the community.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @tim-wright

    great point about dates. Before doing the group by you may need to trim any time component from any date time stamps you are using in the group by.

Setup Info
    Tags
      Help me…