Submit your inspiring success story or innovative use case to the 2022 Dataiku Frontrunner Awards! ENTER YOUR SUBMISSION

Easy "stratify" option at train/test split step of ML model training

In projects involving classification tasks, the vast majority of the datasets I work with are unbalanced.  I don’t think my experience is unusual. It is a common practice to use a stratified split for such datasets so that the train and the test datasets have the same proportion of target classes as in the original dataset. For example, if the percent of positive cases in the original dataset was 10% then the percent of positive cases in the train and test sets would also be 10%. For smaller datasets that are quite unbalanced, using a stratified split is essential.

Currently DSS provides an option to use stratified splitting when doing a hyperparameter search. However, such an easy “check a box” option is not available for the initial train / test split.  It can be achieved currently by doing the split as a step in the flow and then selecting the option to use separate datasets for train and test. However, this is extra work and definitely inconvenient. Moreover, by doing this, one loses the option to use a k-fold splitting approach to calculate margins on the accuracy metric. Finally, requiring this extra step means that stratification may not be used when it should be (for a variety of reasons including lack of knowledge) resulting in a poorer model and/or an inaccurate estimate of accuracy.

The request then is to add a check box on the train / test split screen to enable a stratified split. A nice to have additional feature would be presetting this to checked when an unbalanced dataset is detected. 

Thanks,

Marlan

3 Comments
Krishna
Dataiker
Dataiker
Status changed to: In Backlog

Thanks @Marlan , totally agree! It's a great idea in our backlog.

Marlan
Neuron
Neuron

Good to hear, @Krishna - thanks!

RohitRanga
Level 3

In the case of multi-label classification, we usually use the iterative_train_test_split from skmultilearn library. This would be good to have too.