Discover the winners & finalists of the 2022 Dataiku Frontrunner Awards!READ THEIR USE CASES

Easy "stratify" option at train/test split step of ML model training

In projects involving classification tasks, the vast majority of the datasets I work with are unbalanced.  I don’t think my experience is unusual. It is a common practice to use a stratified split for such datasets so that the train and the test datasets have the same proportion of target classes as in the original dataset. For example, if the percent of positive cases in the original dataset was 10% then the percent of positive cases in the train and test sets would also be 10%. For smaller datasets that are quite unbalanced, using a stratified split is essential.

Currently DSS provides an option to use stratified splitting when doing a hyperparameter search. However, such an easy “check a box” option is not available for the initial train / test split.  It can be achieved currently by doing the split as a step in the flow and then selecting the option to use separate datasets for train and test. However, this is extra work and definitely inconvenient. Moreover, by doing this, one loses the option to use a k-fold splitting approach to calculate margins on the accuracy metric. Finally, requiring this extra step means that stratification may not be used when it should be (for a variety of reasons including lack of knowledge) resulting in a poorer model and/or an inaccurate estimate of accuracy.

The request then is to add a check box on the train / test split screen to enable a stratified split. A nice to have additional feature would be presetting this to checked when an unbalanced dataset is detected. 



Status changed to: In Backlog

Thanks @Marlan , totally agree! It's a great idea in our backlog.

Good to hear, @Krishna - thanks!

Level 3

In the case of multi-label classification, we usually use the iterative_train_test_split from skmultilearn library. This would be good to have too.

Dataiker Alumni
Status changed to: Delivered

You asked for it... we did it!

In classification modelling tasks, you can now replace the default random sampling by a stratified sampling to preserve target variable distribution within every split.


This sampling option is currently only available if K-fold cross-test is enabled, we will explore support of other splitting strategies in the future.

Stay tuned for more updates.

This is a very welcome improvement - thank you!