Easy "stratify" option at train/test split step of ML model training

Marlan · ‎04-19-2022

In projects involving classification tasks, the vast majority of the datasets I work with are unbalanced. I don’t think my experience is unusual. It is a common practice to use a stratified split for such datasets so that the train and the test datasets have the same proportion of target classes as in the original dataset. For example, if the percent of positive cases in the original dataset was 10% then the percent of positive cases in the train and test sets would also be 10%. For smaller datasets that are quite unbalanced, using a stratified split is essential.

Currently DSS provides an option to use stratified splitting when doing a hyperparameter search. However, such an easy “check a box” option is not available for the initial train / test split. It can be achieved currently by doing the split as a step in the flow and then selecting the option to use separate datasets for train and test. However, this is extra work and definitely inconvenient. Moreover, by doing this, one loses the option to use a k-fold splitting approach to calculate margins on the accuracy metric. Finally, requiring this extra step means that stratification may not be used when it should be (for a variety of reasons including lack of knowledge) resulting in a poorer model and/or an inaccurate estimate of accuracy.

The request then is to add a check box on the train / test split screen to enable a stratified split. A nice to have additional feature would be presetting this to checked when an unbalanced dataset is detected.

Thanks,

Marlan

Krishna · ‎04-19-2022

Thanks @Marlan , totally agree! It's a great idea in our backlog.

Marlan · ‎04-19-2022

Good to hear, @Krishna - thanks!

RohitRanga · ‎05-11-2022

In the case of multi-label classification, we usually use the iterative_train_test_split from skmultilearn library. This would be good to have too.

CoreyS · ‎10-24-2022

Looking for more resources to help you use Dataiku effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as ‘Accepted Solution’ to help others like you!

Stef · ‎10-31-2022

You asked for it... we did it!

In classification modelling tasks, you can now replace the default random sampling by a stratified sampling to preserve target variable distribution within every split.

This sampling option is currently only available if K-fold cross-test is enabled, we will explore support of other splitting strategies in the future.

Stay tuned for more updates.

Marlan · ‎10-31-2022

This is a very welcome improvement - thank you!

Marlan

Easy "stratify" option at train/test split step of ML model training

Labels

Machine Learning

I want to use Dataiku in Japanese.

Programmatic Git Support (Shell, Python API or Both)

Method to re-order V12 Visual ML override rules

Labeling > Support providing Annotations as optional Input