Community Conundrum 28: News Engagement is live! Read More

Split dataset by stratified sampling.

Level 3
Split dataset by stratified sampling.

When I try to "split" a dataset randomly, I currently get the following options:

- Full random

- Random subset

Neither of those is what I often use to split into training/test data: Stratified sampling, to ensure that classes with very low presence (e.g. only a few dozen of 10000) are present in both sets. Is there something I overlooked, or is this not currently implemented?


2 Replies

This feature does not yet exist, it is however in our backlog (but we don't yet have a target date for it).

This can be done in a Python recipe with a bit of help from pandas and scikit-learn.
Level 2
Thanks - I implemented as you said with a python recipe. For others reading this later it's a one liner -

df_train, df_test = train_test_split(df, test_size=0.2, stratify=df['label'])

But I'd love to see this added as a feature!
0 Kudos
Labels (2)
A banner prompting to get Dataiku DSS