Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
When I try to "split" a dataset randomly, I currently get the following options:
- Full random
- Random subset
Neither of those is what I often use to split into training/test data: Stratified sampling, to ensure that classes with very low presence (e.g. only a few dozen of 10000) are present in both sets. Is there something I overlooked, or is this not currently implemented?
And if you have values of 'label' that appear in only a single record, and you want to make sure those records go to the training set, you need a few more lines:
import numpy as np
values, counts = np.unique(df['label'], return_counts=True)
valseq1 = values[counts == 1]
valsgt1 = values[counts > 1]
counteq1_df = df[df['label'].isin(valseq1)]
countgt1_df = df[df['label'].isin(valsgt1)]
df_train, df_test = train_test_split(countgt1_df, test_size=0.2, stratify=countgt1_df['label'])
df_train = pd.concat([df_train, counteq1_df], axis=0)