Custom Grouping or Binning

Solved!
cdutoit
Level 3
Custom Grouping or Binning

Hi - I'm trying to get advice on creating custom grouping 'bins'.

My raw data file is essentially a list of companies that include employee count and corresponding revenue. I would like to see a breakdown of revenue by employee count, in defined groups/bins/brackets.

Clearly I can easily run a Group By recipe to get the total revenue for each group of employees:

Screen Shot 2021-10-16 at 10.12.38 AM.png

That's all good but I want to group it by a custom set of brackets or bins.

For example, in groups of Employee count of 1-10, 10-50, 50-200 (different ranges):

Screen Shot 2021-10-16 at 10.13.53 AM.png

Ideally at the "data" level instead of using the Table Chart, because the table Chart does not allow custom binning ranges, and also does not allow customizing the sort order.

So ideally trying to figure out the recipe needed to create a new dataset that is custom grouped by employee "ranges" instead of at each distinct employee count.

Any pointers greatly appreciated.

 

0 Kudos
1 Solution
tgb417

@cdutoit ,

In thinking about your challenge I was thinking about using a join recipe.

First you might create a 3 column table, something like this

min_valmax_valdollar_range_name
01000000โ€œ$0 - $1Mโ€
1000000.0110000000โ€œ$1M - $10Mโ€
10000000.0150000000โ€œ$10M - $50Mโ€

 

Note you need to be careful not to have overlapping range values in this table.

Then you could use a join recipe to join the table above to the data you are showing in the first table.

There would be two conditions of your join conditions

revenue_aggregate >= min_val

revenue_aggregate <= max_val

If done carefully this can allow you to add a new categorical column dollar_range_name to each row of your data. Giving you the custom dollar values ranges shown at the top of the final table you are showing.  

The same approach with a different table would allow you to add an employee count custom range as well.  

What is not clear to me is the values in the cells of your final results.  Is it something like an average across all of the organizations with similar employee count range and revenue range?  If so this could be done by another group by recipe.  

Just some thoughts that I hope are useful to you.  Please let us know how you are getting on with this.  Others please jump in with other ideas. There are of course many many ways to do this kind of thing.  

P.S. You could also use a visual recipe and a formula step that uses some nested if statements to add a column with the custom range for revenue.  And the same approach for an employee count custom range.  Depending on the number of records you are dealing with and how often these ranges might change, the formula steps might be easier.  Whatever you do the goal is to add two columns with these custom ranges.  The next challenge will be sorting the ranges for presentation. 

--Tom

View solution in original post

6 Replies
tgb417

@cdutoit ,

In thinking about your challenge I was thinking about using a join recipe.

First you might create a 3 column table, something like this

min_valmax_valdollar_range_name
01000000โ€œ$0 - $1Mโ€
1000000.0110000000โ€œ$1M - $10Mโ€
10000000.0150000000โ€œ$10M - $50Mโ€

 

Note you need to be careful not to have overlapping range values in this table.

Then you could use a join recipe to join the table above to the data you are showing in the first table.

There would be two conditions of your join conditions

revenue_aggregate >= min_val

revenue_aggregate <= max_val

If done carefully this can allow you to add a new categorical column dollar_range_name to each row of your data. Giving you the custom dollar values ranges shown at the top of the final table you are showing.  

The same approach with a different table would allow you to add an employee count custom range as well.  

What is not clear to me is the values in the cells of your final results.  Is it something like an average across all of the organizations with similar employee count range and revenue range?  If so this could be done by another group by recipe.  

Just some thoughts that I hope are useful to you.  Please let us know how you are getting on with this.  Others please jump in with other ideas. There are of course many many ways to do this kind of thing.  

P.S. You could also use a visual recipe and a formula step that uses some nested if statements to add a column with the custom range for revenue.  And the same approach for an employee count custom range.  Depending on the number of records you are dealing with and how often these ranges might change, the formula steps might be easier.  Whatever you do the goal is to add two columns with these custom ranges.  The next challenge will be sorting the ranges for presentation. 

--Tom
cdutoit
Level 3
Author

@tgb417 Ahhh...thank you so much for pointing me in the right direction. Not sure why that escaped me. I went with your Formula approach and that works:

 

Screen Shot 2021-10-16 at 2.24.50 PM.png

And yes, you're right the sorting is messed up for presentation purposes. I suppose one could prefix each with a "A", "B" etc to force a sort order or something.

Thanks so much for your reply and getting me steered in the right direction.

cdutoit
Level 3
Author

@tgb417 In the "more than 1 way to skin a cat" bucket, as I was working through your suggestion, it occurred to me one could also use the Bin step in a recipe since it, and I didn't know, actually supports custom binning:

 

Screen Shot 2021-10-16 at 2.33.53 PM.png

Thanks again for breaking through my mental block!

tgb417

Take a quick look at the bining step above.  You appear to have an overlapping ranges and bin name mismatch on the second row of inputs.

--Tom
0 Kudos
tgb417

One can sometimes add a rank or sort order column for labels as a separate column.  Making a header like below.  You would sort on the Rank Value.

  1234
  $0-$1M$1M-$10M$10M - $50M$50M - $100M
11-10    
210-25    
325-50    

 

Finally there are some systems that have ordered categorical values. The R language calls these factors.  One can install R into Dataiku DSS.  (In my experience it has been a bit tricky.)  Iโ€™m not clear if generally DSS currently supports ordered categorical variables.  

--Tom
0 Kudos
tgb417

Since the original writing of my post here Dataiku has added the visual recipe step โ€œcreate if, then, elseโ€ that may also be helpful for this type of use case.

https://doc.dataiku.com/dss/latest/preparation/processors/create-if-then-else.html 

--Tom
0 Kudos