Splitting dataset
miguel
Registered Posts: 7 ✭✭✭✭
Hi, I have a dataset with too few churn iterations (0.9 are non-churners) so I want to split the dataset into train and test set but I would like to have higher percentage of churner in the train test.
I tried to use the split recipe but I can't manage to get what I want (either I get the same representation or churners in the train set or a I get only churners)
I tried to use the split recipe but I can't manage to get what I want (either I get the same representation or churners in the train set or a I get only churners)
Tagged:
Answers
-
Hi,
A way to achieve this is to do a splitting with "filters" mode, and define a filter by a formula.
For example, split into "train_set_with_more_churners" and "test_set_with_fewer_churners", use:
* A filter that sends into "train_set_with_more_churners" with formula like:
if (churner == 1, rand() < 0.8, rand() < 0.5)
* Send all other values into "test_set_with_fewer_churners"
This way:
* 80% of churners will be sent to train set, 20% of churners to test set
* 50% of non-churners will be sent to train set, 50% to test set
If you have enough data and can afford to waste some, you can also use a sampling recipe in "class rebalancing" mode (but that will subsample so you will remove some non-churners) -
Thanks for the advice !
I finally found an option to rebalance the sample before training the model. However, I don't know how much percents of churner I have in my data sample. Is there a way to know it ? To know if I need to retrain my model or not. -
Hi Miguel, you can go to the dataset view, click on the column header where you have this information, and select 'Analyze' > categorical.