Survey banner
The Dataiku Community is moving to a new home! Some short term disruption starting next week: LEARN MORE

Binning Strategy for Subpopulation Analysis?

Solved!
Jason
Level 4
Binning Strategy for Subpopulation Analysis?

I am doing an analysis on the US Census ML dataset, and I've built a model.  When I evaluate that model in the Subpopulation Analysis tab, I am choosing "age" as my variable.  At the top, I can see it has created 10 bins to put the data in, and they are all roughly the same (between 8 and 12%) modality.  However, the spreads (min max) of each bin seems to have no rhyme or reason.  Some bins cover 4 years, one bin covers 32 years.  (See screen shot)

Binning Strategy.png

 Can someone explain what the binning strategy is here?  It's not a terribly useful breakout.  For age, there is an obvious fix which is to create a new column that contains labels aligned with whatever breakpoints I like, but if my variable wasn't age, but some other continuous variable, I'd like to know what the strategy is for binning the variable.  Is this strategy selectable somewhere in the interface?

I don't have an example immediately available, but from memory, I seem to recall doing this on a continuous variable in a different dataset/model and the modalities were not so well balanced, which makes me suspect either something has changed (we recently upgraded to 11) or there is a decision behind the scenes about how it will bin.

Thanks,

-Jason

0 Kudos
1 Solution
AlexT
Dataiker

Hi @Jason ,
I believe this query has been answered on different channel, just posting here for visibility.

Both categorical and numeric variables can be used to define subpopulations. When a numeric variable is chosen instead of a categorical one, the distribution divided into bins. The blue bars represent the percentage of values belonging to that category (so based on the customer's dataset, ages 22-26 make up 10%, ages 58-90 make 10%, etc.) Selecting a specific subpopulation reveals that group’s density chart and confusion matrix for a classification task. More information about this feature can be found here: https://knowledge.dataiku.com/latest/ml-analytics/model-results/concept-subpopulation-analysis.html#...

Currently, it is not possible to customize the number of bins in the subpopulation analysis. This feature request logged in our backlog.


View solution in original post

0 Kudos
1 Reply
AlexT
Dataiker

Hi @Jason ,
I believe this query has been answered on different channel, just posting here for visibility.

Both categorical and numeric variables can be used to define subpopulations. When a numeric variable is chosen instead of a categorical one, the distribution divided into bins. The blue bars represent the percentage of values belonging to that category (so based on the customer's dataset, ages 22-26 make up 10%, ages 58-90 make 10%, etc.) Selecting a specific subpopulation reveals that group’s density chart and confusion matrix for a classification task. More information about this feature can be found here: https://knowledge.dataiku.com/latest/ml-analytics/model-results/concept-subpopulation-analysis.html#...

Currently, it is not possible to customize the number of bins in the subpopulation analysis. This feature request logged in our backlog.


0 Kudos