Binning Strategy for Subpopulation Analysis?
I am doing an analysis on the US Census ML dataset, and I've built a model. When I evaluate that model in the Subpopulation Analysis tab, I am choosing "age" as my variable. At the top, I can see it has created 10 bins to put the data in, and they are all roughly the same (between 8 and 12%) modality. However, the spreads (min max) of each bin seems to have no rhyme or reason. Some bins cover 4 years, one bin covers 32 years. (See screen shot)
Can someone explain what the binning strategy is here? It's not a terribly useful breakout. For age, there is an obvious fix which is to create a new column that contains labels aligned with whatever breakpoints I like, but if my variable wasn't age, but some other continuous variable, I'd like to know what the strategy is for binning the variable. Is this strategy selectable somewhere in the interface?
I don't have an example immediately available, but from memory, I seem to recall doing this on a continuous variable in a different dataset/model and the modalities were not so well balanced, which makes me suspect either something has changed (we recently upgraded to 11) or there is a decision behind the scenes about how it will bin.
Thanks,
-Jason
Best Answer
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi @Jason
,
I believe this query has been answered on different channel, just posting here for visibility.Both categorical and numeric variables can be used to define subpopulations. When a numeric variable is chosen instead of a categorical one, the distribution divided into bins. The blue bars represent the percentage of values belonging to that category (so based on the customer's dataset, ages 22-26 make up 10%, ages 58-90 make 10%, etc.) Selecting a specific subpopulation reveals that group’s density chart and confusion matrix for a classification task. More information about this feature can be found here: https://knowledge.dataiku.com/latest/ml-analytics/model-results/concept-subpopulation-analysis.html#concept-subpopulation-analysis
Currently, it is not possible to customize the number of bins in the subpopulation analysis. This feature request logged in our backlog.