Nested Cross Validation, Group Fold, and support for column with fold id assignment?
questions:
1- Although we have a custom kfold option on the grid search (inner), we don't have the custom option for kfold in train/test (outer) performance eval?
2-How to change the custom code sample on the grid search custom k fold option to allow for GroupKFold, ex: sklearn.model_selection.GroupKFold ?
3-for the outer train/test kfold, how to use a custom column with fold if assignment?
thanks!
Rui
Best Answer
-
Hello,
Thanks for your input. Please find answers below in italic:
1- Although we have a custom kfold option on the grid search (inner), we don't have the custom option for kfold in train/test (outer) performance eval?
At the moment, the custom kfold option in only available in the inner grid-search for finding hyperparameters. Thanks for the suggestion, we will see if we can add this feature on the outer train/test kfold in the future.
2-How to change the custom code sample on the grid search custom k fold option to allow for GroupKFold, ex: sklearn.model_selection.GroupKFold ?
You can find code samples on the GroupKFold in the code samples of the custom CV code screen. See below:Note that at the moment it only works with integer columns which are passed as input to the model. We are looking to improve that in the future.
3-for the outer train/test kfold, how to use a custom column with fold if assignment?See question 1: custom kfold on the train/test is not supported at the moment. Thanks for the input, it would be an interesting feature indeed.
In general, if you want to configure you cross-validation strategy in a custom way that is not available in the visual interface, I suggest exporting one of the visual Machine Learning models as Jupyter notebooks, and use it as a starting base to develop your own code.
Cheers,
Alexandre
Answers
-
Hi Alexandre, thanks for replying, looking into it
one question still, sorry, how to use the leave one out sample to do this by ClientId for example? I want to do cross fold in a way all records belonging to a client are on train or test fold, but never both? should by integer columns be fold ids? or user ids?
still unclear
thx! -
Two cases:
1. Assuming you have several rows by ClientId:
In order to apply a leave-one-out strategy, you would use our code sample for DKULeaveOneGroupOut:
from dataiku.doctor.utils import crossval
# You need to select the column (of the design matrix) that is used to split the dataset
# This column is *after preprocessing* - so for example, categorical columns are not available
# anymore.
# To know the names of the columns after preprocessing, train a first model with regular crossval
# and find the names in the "Features" section of the model results.
# Note that the column will always be used for training
cv = crossval.DKULeaveOneGroupOut("")
# Client_id needs to be an integer
Note that it is not ideal since it means that the client_id will be fed to the model. So there is a slight risk of overfitting. We are looking to improve that in the future.
2. Assuming there is only one row per client id:
Directly use sklearn.model_selection.LeaveOneOut in the custom CV code screen -
ok checking it, thx Alexandre!
-
Hi Alexandre,
Do you have any idea of when this custom Cross Validation option will be available ?
Thank you,
Oscar -
Do you mean the ability to code your own CV object for the "test" phase? As it is possible for the hyperparameter search phase?
-
Yes absolutly, it would be very convenient. I personnaly would love to be able to use GroupKFold cross validation in the test phase.
-
Thanks for the feedback, I have logged this to our product team.