Splitting dataset

miguel
Level 1
Splitting dataset
Hi, I have a dataset with too few churn iterations (0.9 are non-churners) so I want to split the dataset into train and test set but I would like to have higher percentage of churner in the train test.
I tried to use the split recipe but I can't manage to get what I want (either I get the same representation or churners in the train set or a I get only churners)
0 Kudos
3 Replies
Clรฉment_Stenac
Hi,

A way to achieve this is to do a splitting with "filters" mode, and define a filter by a formula.

For example, split into "train_set_with_more_churners" and "test_set_with_fewer_churners", use:

* A filter that sends into "train_set_with_more_churners" with formula like:
if (churner == 1, rand() < 0.8, rand() < 0.5)

* Send all other values into "test_set_with_fewer_churners"

This way:
* 80% of churners will be sent to train set, 20% of churners to test set
* 50% of non-churners will be sent to train set, 50% to test set



If you have enough data and can afford to waste some, you can also use a sampling recipe in "class rebalancing" mode (but that will subsample so you will remove some non-churners)
0 Kudos
miguel
Level 1
Author
Thanks for the advice !
I finally found an option to rebalance the sample before training the model. However, I don't know how much percents of churner I have in my data sample. Is there a way to know it ? To know if I need to retrain my model or not.
0 Kudos
Alex_Combessie
Dataiker Alumni
Hi Miguel, you can go to the dataset view, click on the column header where you have this information, and select 'Analyze' > categorical.
0 Kudos