Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
When I try to "split" a dataset randomly, I currently get the following options:
- Full random
- Random subset
Neither of those is what I often use to split into training/test data: Stratified sampling, to ensure that classes with very low presence (e.g. only a few dozen of 10000) are present in both sets. Is there something I overlooked, or is this not currently implemented?
Well, a 2-liner, the other one being
from sklearn.model_selection import train_test_split
😉
And if you have values of 'label' that appear in only a single record, and you want to make sure those records go to the training set, you need a few more lines:
import numpy as np
values, counts = np.unique(df['label'], return_counts=True)
valseq1 = values[counts == 1]
valsgt1 = values[counts > 1]
counteq1_df = df[df['label'].isin(valseq1)]
countgt1_df = df[df['label'].isin(valsgt1)]
df_train, df_test = train_test_split(countgt1_df, test_size=0.2, stratify=countgt1_df['label'])
df_train = pd.concat([df_train, counteq1_df], axis=0)