Nested cross validation and chosen best parameter from grid search by algorithm?

Solved!
UserBird
Dataiker
Nested cross validation and chosen best parameter from grid search by algorithm?

Hi, when using nested cross validation for hyper parameter search/performance estimation, it's not clear how the best hyper parameters are chosen?



As in nested cv, we can actually have different "best parameters" by outer fold (ie, not stable hyper params), how does dataiku selects and reports the best one?



thanks!



(from https://sebastianraschka.com/faq/docs/evaluate-a-model.html)



Using nested cross-validation you will train m different logistic regression models, 1 for each of the m outer folds, and the inner folds are used to optimize the hyperparameters of each model (e.g., using gridsearch in combination with k-fold cross-validation.



If your model is stable, these mmodels should all have the same hyperparameter values, and you report the average performance of this model based on the outer test folds. Then, you proceed with the next algorithm, e.g., an SVM etc.

0 Kudos
1 Solution
Alex_Combessie
Dataiker Alumni

Hello Rui,



This is a good question and important topic indeed. 



When using K-fold both for hyperparameter search and for testing in DSS visual ML interface, what happens for one given model type follows these steps:





  1. Hyperparameter search: The dataset is split into K_hyperparam random parts with stratification with respect to the target. For each combination of hyperparameter in the grid, a model is trained K times to find the best combination. Finally, the model with the best combination is retrained on the entire dataset. This will be the model used for deployment purposes.





    1. Example for 3 folds:





      • for each combination of hyperparameter:





        • train the model on folds 1+2 then evaluate on fold 3,




        • train the model on folds 1+3 then evaluate on fold 2,




        • train the model on folds 2+3 then evaluate on fold 1






      • Choose the combination of hyperparameters that maximizes the average of the chosen performance metric on all 3 folds








  2. Test: The dataset is split again into K_test  random parts, independently from the previous randomization, and with no stratification with respect to the target. The model with the best hyperparameter combination of Step 1 is trained and evaluated on the new test folds in a similar way as previously. The reported performance metrics are averaged across folds.





Hence formulas for the number of model trainings are the following:




  • For a given model type:






  • For all model types selected:





A few important precisions:




  • The hyperparameter search and test are done independently, but rely on the same random folds across model types.

  • By choice, this process is different from a "nested" strategy combining hyperparameter search and test sequentially (see https://sebastianraschka.com/faq/docs/evaluate-a-model.html - Scenario 3). We chose that by design to avoid the following drawbacks of this strategy


    • high computational cost: number of models to train ~ K_hyperparam x K_test  instead of K_hyperparam + K_test




    • smaller folds, which require having a lot of labels, potentially failing for rare classes







Cheers,



Alex



 



 

View solution in original post

4 Replies
Alex_Combessie
Dataiker Alumni

Hello Rui,



This is a good question and important topic indeed. 



When using K-fold both for hyperparameter search and for testing in DSS visual ML interface, what happens for one given model type follows these steps:





  1. Hyperparameter search: The dataset is split into K_hyperparam random parts with stratification with respect to the target. For each combination of hyperparameter in the grid, a model is trained K times to find the best combination. Finally, the model with the best combination is retrained on the entire dataset. This will be the model used for deployment purposes.





    1. Example for 3 folds:





      • for each combination of hyperparameter:





        • train the model on folds 1+2 then evaluate on fold 3,




        • train the model on folds 1+3 then evaluate on fold 2,




        • train the model on folds 2+3 then evaluate on fold 1






      • Choose the combination of hyperparameters that maximizes the average of the chosen performance metric on all 3 folds








  2. Test: The dataset is split again into K_test  random parts, independently from the previous randomization, and with no stratification with respect to the target. The model with the best hyperparameter combination of Step 1 is trained and evaluated on the new test folds in a similar way as previously. The reported performance metrics are averaged across folds.





Hence formulas for the number of model trainings are the following:




  • For a given model type:






  • For all model types selected:





A few important precisions:




  • The hyperparameter search and test are done independently, but rely on the same random folds across model types.

  • By choice, this process is different from a "nested" strategy combining hyperparameter search and test sequentially (see https://sebastianraschka.com/faq/docs/evaluate-a-model.html - Scenario 3). We chose that by design to avoid the following drawbacks of this strategy


    • high computational cost: number of models to train ~ K_hyperparam x K_test  instead of K_hyperparam + K_test




    • smaller folds, which require having a lot of labels, potentially failing for rare classes







Cheers,



Alex



 



 

UserBird
Dataiker
Author
Hi Alex, really interesting feedback, if I could rate it, 5 stars! ๐Ÿ™‚ Now it's much clear.
Don't remember seeing this nested cv alternative described, is there published research on how this compares to the 1) biased simple non nested cv / 2) unbiased nested cv? Somewhere in the middle? Understand the computation benefits but for sure there is a catch right? ๐Ÿ™‚
thx!
0 Kudos
Alex_Combessie
Dataiker Alumni
Hi Rui,
Cross-validation is a big topic. There are many papers about it, one of them about reusable holdout, which I personally found interesting: http://www.cis.upenn.edu/~aaroth/reusable.html. It is a balance between statistical robustness, amount of training data required and computing time/resources. While the scenario 3 described by Sebastian Raschka is more statistically robust on paper, it requires a lot of labels, so would fail with "small" training sets. In practice, it depends on the subject you are applying ML to, the amount of data and time you have, etc. If you feel the need to deploy a custom cross-validation strategy, you can switch to python in a Jupyter notebook to find the best hyperparameters, and then use our brand new ML API (https://doc.dataiku.com/dss/4.2//publicapi/client-python/ml.html) to deploy it as a DSS model.
0 Kudos
UserBird
Dataiker
Author
Thx again Alex for the info, know reusable holdout, very interesting also indeed. (think h2o driverless AI uses it, or at least remember someone mentioning it in a webinar)
huge thanks again for the precious info, will do some testing on the approach above, never heard it before. Few initial tests seem to show test score estimate worst than the biased best params score as wanted, but no much different from a simple grid search score mean avg. (small churn dataset) . have to dig deeper.
kind regards
Rui
ps-new new ML API seems great! reading!
0 Kudos