Splitting dataset

miguel
miguel Registered Posts: 7 ✭✭✭✭
Hi, I have a dataset with too few churn iterations (0.9 are non-churners) so I want to split the dataset into train and test set but I would like to have higher percentage of churner in the train test.
I tried to use the split recipe but I can't manage to get what I want (either I get the same representation or churners in the train set or a I get only churners)

Answers

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker
    Hi,

    A way to achieve this is to do a splitting with "filters" mode, and define a filter by a formula.

    For example, split into "train_set_with_more_churners" and "test_set_with_fewer_churners", use:

    * A filter that sends into "train_set_with_more_churners" with formula like:
    if (churner == 1, rand() < 0.8, rand() < 0.5)

    * Send all other values into "test_set_with_fewer_churners"

    This way:
    * 80% of churners will be sent to train set, 20% of churners to test set
    * 50% of non-churners will be sent to train set, 50% to test set



    If you have enough data and can afford to waste some, you can also use a sampling recipe in "class rebalancing" mode (but that will subsample so you will remove some non-churners)
  • miguel
    miguel Registered Posts: 7 ✭✭✭✭
    Thanks for the advice !
    I finally found an option to rebalance the sample before training the model. However, I don't know how much percents of churner I have in my data sample. Is there a way to know it ? To know if I need to retrain my model or not.
  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Hi Miguel, you can go to the dataset view, click on the column header where you have this information, and select 'Analyze' > categorical.
Setup Info
    Tags
      Help me…