Hi, when using nested cross validation for hyper parameter search/performance estimation, it's not clear how the best hyper parameters are chosen?
As in nested cv, we can actually have different "best parameters" by outer fold (ie, not stable hyper params), how does dataiku selects and reports the best one?
Using nested cross-validation you will train m different logistic regression models, 1 for each of the m outer folds, and the inner folds are used to optimize the hyperparameters of each model (e.g., using gridsearch in combination with k-fold cross-validation.
If your model is stable, these mmodels should all have the same hyperparameter values, and you report the average performance of this model based on the outer test folds. Then, you proceed with the next algorithm, e.g., an SVM etc.
This is a good question and important topic indeed.
When using K-fold both for hyperparameter search and for testing in DSS visual ML interface, what happens for one given model type follows these steps:
Hyperparameter search: The dataset is split into K_hyperparam random parts with stratification with respect to the target. For each combination of hyperparameter in the grid, a model is trained K times to find the best combination. Finally, the model with the best combination is retrained on the entire dataset. This will be the model used for deployment purposes.
Example for 3 folds:
for each combination of hyperparameter:
train the model on folds 1+2 then evaluate on fold 3,
train the model on folds 1+3 then evaluate on fold 2,
train the model on folds 2+3 then evaluate on fold 1
Choose the combination of hyperparameters that maximizes the average of the chosen performance metric on all 3 folds
Test: The dataset is split again into K_test random parts, independently from the previous randomization, and with no stratification with respect to the target. The model with the best hyperparameter combination of Step 1 is trained and evaluated on the new test folds in a similar way as previously. The reported performance metrics are averaged across folds.
Hence formulas for the number of model trainings are the following:
A few important precisions:
high computational cost: number of models to train ~ K_hyperparam x K_test instead of K_hyperparam + K_test
smaller folds, which require having a lot of labels, potentially failing for rare classes