Easy "stratify" option at train/test split step of ML model training

Marlan
Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 319 Neuron

In projects involving classification tasks, the vast majority of the datasets I work with are unbalanced. I don’t think my experience is unusual. It is a common practice to use a stratified split for such datasets so that the train and the test datasets have the same proportion of target classes as in the original dataset. For example, if the percent of positive cases in the original dataset was 10% then the percent of positive cases in the train and test sets would also be 10%. For smaller datasets that are quite unbalanced, using a stratified split is essential.

Currently DSS provides an option to use stratified splitting when doing a hyperparameter search. However, such an easy “check a box” option is not available for the initial train / test split. It can be achieved currently by doing the split as a step in the flow and then selecting the option to use separate datasets for train and test. However, this is extra work and definitely inconvenient. Moreover, by doing this, one loses the option to use a k-fold splitting approach to calculate margins on the accuracy metric. Finally, requiring this extra step means that stratification may not be used when it should be (for a variety of reasons including lack of knowledge) resulting in a poorer model and/or an inaccurate estimate of accuracy.

The request then is to add a check box on the train / test split screen to enable a stratified split. A nice to have additional feature would be presetting this to checked when an unbalanced dataset is detected.

Thanks,

Marlan

4
4 votes

Released · Last Updated

Comments

  • Krishna
    Krishna Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Product Ideas Manager Posts: 18 Dataiker

    Thanks @Marlan
    , totally agree! It's a great idea in our backlog.

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 319 Neuron

    Good to hear, @Krishna
    - thanks!

  • RohitRanga
    RohitRanga Registered Posts: 41 ✭✭✭✭

    In the case of multi-label classification, we usually use the iterative_train_test_split from skmultilearn library. This would be good to have too.

  • Stef
    Stef Dataiker, Registered, Product Ideas Manager Posts: 5 Dataiker

    You asked for it... we did it!

    In classification modelling tasks, you can now replace the default random sampling by a stratified sampling to preserve target variable distribution within every split.

    Stef_0-1667224160709.png

    This sampling option is currently only available if K-fold cross-test is enabled, we will explore support of other splitting strategies in the future.

    Stay tuned for more updates.

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 319 Neuron

    This is a very welcome improvement - thank you!

    Marlan

Setup Info
    Tags
      Help me…