Nested Cross Validation, Group Fold, and support for column with fold id assignment?

UserBird
UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
Hi, I'm trying to understand how to implement proper nested cross validation, but using group k fold (data is non iid, so all lines for a subject must be in the same fold), if possible using precalculated fold id column on dataset.

questions:

1- Although we have a custom kfold option on the grid search (inner), we don't have the custom option for kfold in train/test (outer) performance eval?

2-How to change the custom code sample on the grid search custom k fold option to allow for GroupKFold, ex: sklearn.model_selection.GroupKFold ?

3-for the outer train/test kfold, how to use a custom column with fold if assignment?



thanks!

Rui

Best Answer

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Answer ✓

    Hello,

    Thanks for your input. Please find answers below in italic:

    1- Although we have a custom kfold option on the grid search (inner), we don't have the custom option for kfold in train/test (outer) performance eval?

    At the moment, the custom kfold option in only available in the inner grid-search for finding hyperparameters. Thanks for the suggestion, we will see if we can add this feature on the outer train/test kfold in the future.



    2-How to change the custom code sample on the grid search custom k fold option to allow for GroupKFold, ex: sklearn.model_selection.GroupKFold ?

    You can find code samples on the GroupKFold in the code samples of the custom CV code screen. See below:

    Note that at the moment it only works with integer columns which are passed as input to the model. We are looking to improve that in the future.



    3-for the outer train/test kfold, how to use a custom column with fold if assignment?

    See question 1: custom kfold on the train/test is not supported at the moment. Thanks for the input, it would be an interesting feature indeed.

    In general, if you want to configure you cross-validation strategy in a custom way that is not available in the visual interface, I suggest exporting one of the visual Machine Learning models as Jupyter notebooks, and use it as a starting base to develop your own code.

    Cheers,

    Alexandre

Answers

  • UserBird
    UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
    Hi Alexandre, thanks for replying, looking into it

    one question still, sorry, how to use the leave one out sample to do this by ClientId for example? I want to do cross fold in a way all records belonging to a client are on train or test fold, but never both? should by integer columns be fold ids? or user ids?

    still unclear
    thx!
  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Two cases:
    1. Assuming you have several rows by ClientId:
    In order to apply a leave-one-out strategy, you would use our code sample for DKULeaveOneGroupOut:
    from dataiku.doctor.utils import crossval

    # You need to select the column (of the design matrix) that is used to split the dataset
    # This column is *after preprocessing* - so for example, categorical columns are not available
    # anymore.

    # To know the names of the columns after preprocessing, train a first model with regular crossval
    # and find the names in the "Features" section of the model results.

    # Note that the column will always be used for training

    cv = crossval.DKULeaveOneGroupOut("")
    # Client_id needs to be an integer

    Note that it is not ideal since it means that the client_id will be fed to the model. So there is a slight risk of overfitting. We are looking to improve that in the future.

    2. Assuming there is only one row per client id:
    Directly use sklearn.model_selection.LeaveOneOut in the custom CV code screen ;)
  • UserBird
    UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
    ok checking it, thx Alexandre!
  • omallet
    omallet Registered Posts: 3 ✭✭✭✭
    Hi Alexandre,
    Do you have any idea of when this custom Cross Validation option will be available ?
    Thank you,
    Oscar
  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Do you mean the ability to code your own CV object for the "test" phase? As it is possible for the hyperparameter search phase?
  • omallet
    omallet Registered Posts: 3 ✭✭✭✭
    Yes absolutly, it would be very convenient. I personnaly would love to be able to use GroupKFold cross validation in the test phase.
  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Thanks for the feedback, I have logged this to our product team.
Setup Info
    Tags
      Help me…