We're excited to announce that we're launching the second installment of Dataiku Product Days Register Now

Counting Rows

davidkubica1
Level 1
Counting Rows

I have a dataset. Four of the columns are Location Number, Employee ID, Date, and Symbol. I would like to create a 5th column that contains counts of how many times a row with matching data in all four columns appears. For example, there are 5 rows with Location Number=308, Employee ID=37, Date=03/23/2007 and Symbol=TP. The 5th column therefore is "5".

Any suggestions on how I could go about doing this?

Thanks in advance!

0 Kudos
3 Replies
tim-wright
Neuron
Neuron

@davidkubica1 I think you should be able to accomplish this using a GroupBy Recipe followed by a join Recipe.

1. Group your dataset with the visual group by recipe on columns:  Location Number, Employee ID, Date, and Symbol. Use "count" as the aggregation metric. This should result in a dataset that has the number of times each combination of those values occurs (with one row per unique combination of your group by keys in the output).

2.  Left join the output of the GroupBy with your original dataset.

I think that should give you what you're after.

I'm not 100% knowledgeable about how the group by/joins work when using dates. Depending on if your original date column is an actual date (or a date represented as a string) you may have to convert those to string before step 1 and then can convert back to Date after step 2. 

-Tim

tgb417
Neuron
Neuron

@tim-wright 

great point about dates.  Before doing the group by you may need to trim any time component from any date time stamps you are using in the group by.

--Tom
0 Kudos
tgb417
Neuron
Neuron

@davidkubica1 

Welcome to the Dataiku community.

I'm not at my desktop computer at the moment, so I can’t send along any pictures or anything.  

However, you might check out the group by visual recipe.

There is an option where you can select the 4 columns you want to “group by”.

Then select the check box that will tell you the number of time the set of 4 columns appears.  

Run the recipe.  This will give you fewer rows of data than was in your original dataset. One for each unique set of 4 columns and the count of those occurrences.  

If you need to have all of the original rows with this added as an extra column.  You should be able to use a join recipient to connect this summary count back to each of the original records.

There may be other ways to get this done more efficiently. Others please feel free to jump in with other ideas.  

Please let us know how you get on with this and welcome to the community.  

--Tom
0 Kudos
A banner prompting to get Dataiku DSS