Custom Grouping or Binning

cdutoit · ‎10-16-2021

Hi - I'm trying to get advice on creating custom grouping 'bins'.

My raw data file is essentially a list of companies that include employee count and corresponding revenue. I would like to see a breakdown of revenue by employee count, in defined groups/bins/brackets.

Clearly I can easily run a Group By recipe to get the total revenue for each group of employees:

That's all good but I want to group it by a custom set of brackets or bins.

For example, in groups of Employee count of 1-10, 10-50, 50-200 (different ranges):

Ideally at the "data" level instead of using the Table Chart, because the table Chart does not allow custom binning ranges, and also does not allow customizing the sort order.

So ideally trying to figure out the recipe needed to create a new dataset that is custom grouped by employee "ranges" instead of at each distinct employee count.

Any pointers greatly appreciated.

tgb417 · ‎10-16-2021

@cdutoit ,

In thinking about your challenge I was thinking about using a join recipe.

First you might create a 3 column table, something like this

min_val	max_val	dollar_range_name
0	1000000	“$0 - $1M”
1000000.01	10000000	“$1M - $10M”
10000000.01	50000000	“$10M - $50M”

Note you need to be careful not to have overlapping range values in this table.

Then you could use a join recipe to join the table above to the data you are showing in the first table.

There would be two conditions of your join conditions

revenue_aggregate >= min_val

revenue_aggregate <= max_val

If done carefully this can allow you to add a new categorical column dollar_range_name to each row of your data. Giving you the custom dollar values ranges shown at the top of the final table you are showing.

The same approach with a different table would allow you to add an employee count custom range as well.

What is not clear to me is the values in the cells of your final results. Is it something like an average across all of the organizations with similar employee count range and revenue range? If so this could be done by another group by recipe.

Just some thoughts that I hope are useful to you. Please let us know how you are getting on with this. Others please jump in with other ideas. There are of course many many ways to do this kind of thing.

P.S. You could also use a visual recipe and a formula step that uses some nested if statements to add a column with the custom range for revenue. And the same approach for an employee count custom range. Depending on the number of records you are dealing with and how often these ranges might change, the formula steps might be easier. Whatever you do the goal is to add two columns with these custom ranges. The next challenge will be sorting the ranges for presentation.

--Tom

View solution in original post

tgb417 · ‎10-16-2021

@cdutoit ,

In thinking about your challenge I was thinking about using a join recipe.

First you might create a 3 column table, something like this

min_val	max_val	dollar_range_name
0	1000000	“$0 - $1M”
1000000.01	10000000	“$1M - $10M”
10000000.01	50000000	“$10M - $50M”

Note you need to be careful not to have overlapping range values in this table.

Then you could use a join recipe to join the table above to the data you are showing in the first table.

There would be two conditions of your join conditions

revenue_aggregate >= min_val

revenue_aggregate <= max_val

If done carefully this can allow you to add a new categorical column dollar_range_name to each row of your data. Giving you the custom dollar values ranges shown at the top of the final table you are showing.

The same approach with a different table would allow you to add an employee count custom range as well.

What is not clear to me is the values in the cells of your final results. Is it something like an average across all of the organizations with similar employee count range and revenue range? If so this could be done by another group by recipe.

Just some thoughts that I hope are useful to you. Please let us know how you are getting on with this. Others please jump in with other ideas. There are of course many many ways to do this kind of thing.

P.S. You could also use a visual recipe and a formula step that uses some nested if statements to add a column with the custom range for revenue. And the same approach for an employee count custom range. Depending on the number of records you are dealing with and how often these ranges might change, the formula steps might be easier. Whatever you do the goal is to add two columns with these custom ranges. The next challenge will be sorting the ranges for presentation.

--Tom

cdutoit · ‎10-16-2021

@tgb417 Ahhh...thank you so much for pointing me in the right direction. Not sure why that escaped me. I went with your Formula approach and that works:

And yes, you're right the sorting is messed up for presentation purposes. I suppose one could prefix each with a "A", "B" etc to force a sort order or something.

Thanks so much for your reply and getting me steered in the right direction.

cdutoit · ‎10-16-2021

@tgb417 In the "more than 1 way to skin a cat" bucket, as I was working through your suggestion, it occurred to me one could also use the Bin step in a recipe since it, and I didn't know, actually supports custom binning:

Thanks again for breaking through my mental block!

tgb417 · ‎10-16-2021

Take a quick look at the bining step above. You appear to have an overlapping ranges and bin name mismatch on the second row of inputs.

--Tom

tgb417 · ‎10-16-2021

One can sometimes add a rank or sort order column for labels as a separate column. Making a header like below. You would sort on the Rank Value.

		1	2	3	4
		$0-$1M	$1M-$10M	$10M - $50M	$50M - $100M
1	1-10
2	10-25
3	25-50

Finally there are some systems that have ordered categorical values. The R language calls these factors. One can install R into Dataiku DSS. (In my experience it has been a bit tricky.) I’m not clear if generally DSS currently supports ordered categorical variables.

--Tom

tgb417 · ‎04-05-2023

Since the original writing of my post here Dataiku has added the visual recipe step “create if, then, else” that may also be helpful for this type of use case.

https://doc.dataiku.com/dss/latest/preparation/processors/create-if-then-else.html

--Tom

Sign up to take part

Custom Grouping or Binning

Custom Grouping or Binning