Custom Grouping or Binning

cdutoit
cdutoit Registered Posts: 12 ✭✭✭✭

Hi - I'm trying to get advice on creating custom grouping 'bins'.

My raw data file is essentially a list of companies that include employee count and corresponding revenue. I would like to see a breakdown of revenue by employee count, in defined groups/bins/brackets.

Clearly I can easily run a Group By recipe to get the total revenue for each group of employees:

Screen Shot 2021-10-16 at 10.12.38 AM.png

That's all good but I want to group it by a custom set of brackets or bins.

For example, in groups of Employee count of 1-10, 10-50, 50-200 (different ranges):

Screen Shot 2021-10-16 at 10.13.53 AM.png

Ideally at the "data" level instead of using the Table Chart, because the table Chart does not allow custom binning ranges, and also does not allow customizing the sort order.

So ideally trying to figure out the recipe needed to create a new dataset that is custom grouped by employee "ranges" instead of at each distinct employee count.

Any pointers greatly appreciated.

Best Answer

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
    Answer ✓

    @cdutoit
    ,

    In thinking about your challenge I was thinking about using a join recipe.

    First you might create a 3 column table, something like this

    min_valmax_valdollar_range_name
    01000000“$0 - $1M”
    1000000.0110000000“$1M - $10M”
    10000000.0150000000“$10M - $50M”

    Note you need to be careful not to have overlapping range values in this table.

    Then you could use a join recipe to join the table above to the data you are showing in the first table.

    There would be two conditions of your join conditions

    revenue_aggregate >= min_val

    revenue_aggregate <= max_val

    If done carefully this can allow you to add a new categorical column dollar_range_name to each row of your data. Giving you the custom dollar values ranges shown at the top of the final table you are showing.

    The same approach with a different table would allow you to add an employee count custom range as well.

    What is not clear to me is the values in the cells of your final results. Is it something like an average across all of the organizations with similar employee count range and revenue range? If so this could be done by another group by recipe.

    Just some thoughts that I hope are useful to you. Please let us know how you are getting on with this. Others please jump in with other ideas. There are of course many many ways to do this kind of thing.

    P.S. You could also use a visual recipe and a formula step that uses some nested if statements to add a column with the custom range for revenue. And the same approach for an employee count custom range. Depending on the number of records you are dealing with and how often these ranges might change, the formula steps might be easier. Whatever you do the goal is to add two columns with these custom ranges. The next challenge will be sorting the ranges for presentation.

Answers

  • cdutoit
    cdutoit Registered Posts: 12 ✭✭✭✭

    @tgb417
    Ahhh...thank you so much for pointing me in the right direction. Not sure why that escaped me. I went with your Formula approach and that works:

    Screen Shot 2021-10-16 at 2.24.50 PM.png

    And yes, you're right the sorting is messed up for presentation purposes. I suppose one could prefix each with a "A", "B" etc to force a sort order or something.

    Thanks so much for your reply and getting me steered in the right direction.

  • cdutoit
    cdutoit Registered Posts: 12 ✭✭✭✭

    @tgb417
    In the "more than 1 way to skin a cat" bucket, as I was working through your suggestion, it occurred to me one could also use the Bin step in a recipe since it, and I didn't know, actually supports custom binning:

    Screen Shot 2021-10-16 at 2.33.53 PM.png

    Thanks again for breaking through my mental block!

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    One can sometimes add a rank or sort order column for labels as a separate column. Making a header like below. You would sort on the Rank Value.

    1234
    $0-$1M$1M-$10M$10M - $50M$50M - $100M
    11-10
    210-25
    325-50

    Finally there are some systems that have ordered categorical values. The R language calls these factors. One can install R into Dataiku DSS. (In my experience it has been a bit tricky.) I’m not clear if generally DSS currently supports ordered categorical variables.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    Take a quick look at the bining step above. You appear to have an overlapping ranges and bin name mismatch on the second row of inputs.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    Since the original writing of my post here Dataiku has added the visual recipe step “create if, then, else” that may also be helpful for this type of use case.

    https://doc.dataiku.com/dss/latest/preparation/processors/create-if-then-else.html

Setup Info
    Tags
      Help me…