Easy "stratify" option at train/test split step of ML model training

In projects involving classification tasks, the vast majority of the datasets I work with are unbalanced.  I don’t think my experience is unusual. It is a common practice to use a stratified split for such datasets so that the train and the test datasets have the same proportion of target classes as in the original dataset. For example, if the percent of positive cases in the original dataset was 10% then the percent of positive cases in the train and test sets would also be 10%. For smaller datasets that are quite unbalanced, using a stratified split is essential.

Currently DSS provides an option to use stratified splitting when doing a hyperparameter search. However, such an easy “check a box” option is not available for the initial train / test split.  It can be achieved currently by doing the split as a step in the flow and then selecting the option to use separate datasets for train and test. However, this is extra work and definitely inconvenient. Moreover, by doing this, one loses the option to use a k-fold splitting approach to calculate margins on the accuracy metric. Finally, requiring this extra step means that stratification may not be used when it should be (for a variety of reasons including lack of knowledge) resulting in a poorer model and/or an inaccurate estimate of accuracy.

The request then is to add a check box on the train / test split screen to enable a stratified split. A nice to have additional feature would be presetting this to checked when an unbalanced dataset is detected. 

Thanks,

Marlan

6 Comments
Krishna
Dataiker

Thanks @Marlan , totally agree! It's a great idea in our backlog.

Status changed to: In Backlog

Thanks @Marlan , totally agree! It's a great idea in our backlog.

Good to hear, @Krishna - thanks!

Good to hear, @Krishna - thanks!

RohitRanga
Level 3

In the case of multi-label classification, we usually use the iterative_train_test_split from skmultilearn library. This would be good to have too.

In the case of multi-label classification, we usually use the iterative_train_test_split from skmultilearn library. This would be good to have too.

CoreyS
Dataiker Alumni
 
Looking for more resources to help you use Dataiku effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as ‘Accepted Solution’ to help others like you!
Status changed to: Released
 
Stef
Dataiker

You asked for it... we did it!

In classification modelling tasks, you can now replace the default random sampling by a stratified sampling to preserve target variable distribution within every split.

Stef_0-1667224160709.png

This sampling option is currently only available if K-fold cross-test is enabled, we will explore support of other splitting strategies in the future.

Stay tuned for more updates.

You asked for it... we did it!

In classification modelling tasks, you can now replace the default random sampling by a stratified sampling to preserve target variable distribution within every split.

Stef_0-1667224160709.png

This sampling option is currently only available if K-fold cross-test is enabled, we will explore support of other splitting strategies in the future.

Stay tuned for more updates.

This is a very welcome improvement - thank you!

Marlan

This is a very welcome improvement - thank you!

Marlan