Binning Strategy for Subpopulation Analysis?

Jason · April 2023

I am doing an analysis on the US Census ML dataset, and I've built a model. When I evaluate that model in the Subpopulation Analysis tab, I am choosing "age" as my variable. At the top, I can see it has created 10 bins to put the data in, and they are all roughly the same (between 8 and 12%) modality. However, the spreads (min max) of each bin seems to have no rhyme or reason. Some bins cover 4 years, one bin covers 32 years. (See screen shot)

Binning Strategy.png

Can someone explain what the binning strategy is here? It's not a terribly useful breakout. For age, there is an obvious fix which is to create a new column that contains labels aligned with whatever breakpoints I like, but if my variable wasn't age, but some other continuous variable, I'd like to know what the strategy is for binning the variable. Is this strategy selectable somewhere in the interface?

I don't have an example immediately available, but from memory, I seem to recall doing this on a continuous variable in a different dataset/model and the modalities were not so well balanced, which makes me suspect either something has changed (we recently upgraded to 11) or there is a decision behind the scenes about how it will bin.

Thanks,

-Jason

Alexandru · April 2023

Hi @Jason
,
I believe this query has been answered on different channel, just posting here for visibility.

Both categorical and numeric variables can be used to define subpopulations. When a numeric variable is chosen instead of a categorical one, the distribution divided into bins. The blue bars represent the percentage of values belonging to that category (so based on the customer's dataset, ages 22-26 make up 10%, ages 58-90 make 10%, etc.) Selecting a specific subpopulation reveals that group’s density chart and confusion matrix for a classification task. More information about this feature can be found here: https://knowledge.dataiku.com/latest/ml-analytics/model-results/concept-subpopulation-analysis.html#concept-subpopulation-analysis

Currently, it is not possible to customize the number of bins in the subpopulation analysis. This feature request logged in our backlog.

Binning Strategy for Subpopulation Analysis?

Best Answer

Categories

Setup Info

Tags