## Sign up to take part

Registered users can ask their own questions, contribute to discussions, and be part of the Community!

This website uses cookies. By clicking OK, you consent to the use of cookies. Read our cookie policy.

Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Registered users can ask their own questions, contribute to discussions, and be part of the Community!

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Solved!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Nested cross validation and chosen best parameter from grid search by algorithm?

Hi, when using nested cross validation for hyper parameter search/performance estimation, it's not clear how the best hyper parameters are chosen?

As in nested cv, we can actually have different "best parameters" by outer fold (ie, not stable hyper params), how does dataiku selects and reports the best one?

thanks!

(from https://sebastianraschka.com/faq/docs/evaluate-a-model.html)

Using nested cross-validation you will train *m* different logistic regression models, 1 for each of the *m* outer folds, and the inner folds are used to optimize the hyperparameters of each model (e.g., using gridsearch in combination with k-fold cross-validation.

*If your model is stable, these mmodels should all have the same hyperparameter values, and you report the average performance of this model based on the outer test folds. Then, you proceed with the next algorithm, e.g., an SVM etc.*

1 Solution

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hello Rui,

This is a good question and important topic indeed.

When using K-fold both for hyperparameter search and for testing in DSS visual ML interface, what happens for one given model type follows these steps:

Hyperparameter search: The dataset is split into K_hyperparam random parts with stratification with respect to the target. For each combination of hyperparameter in the grid, a model is trained K times to find the best combination. Finally, the model with the best combination is retrained on the entire dataset.

**This will be the model used for deployment purposes.**

Example for 3 folds:

for each combination of hyperparameter:

train the model on folds 1+2 then evaluate on fold 3,

train the model on folds 1+3 then evaluate on fold 2,

train the model on folds 2+3 then evaluate on fold 1

Choose the combination of hyperparameters that maximizes the average of the chosen performance metric on all 3 folds

Test: The dataset is split again into K_test random parts, independently from the previous randomization, and with no stratification with respect to the target. The model with the best hyperparameter combination of Step 1 is trained and evaluated on the new test folds in a similar way as previously. The reported performance metrics are averaged across folds.

Hence formulas for the number of model trainings are the following:

- For a given model type:

- For all model types selected:

A few important precisions:

- The hyperparameter search and test are done independently, but rely on the same random folds across model types.
- By choice, this process is
**different**from a "nested" strategy combining hyperparameter search and test sequentially (see https://sebastianraschka.com/faq/docs/evaluate-a-model.html - Scenario 3). We chose that by design to avoid the following drawbacks of this strategy

high computational cost: number of models to train ~ K_hyperparam x K_test instead of K_hyperparam + K_test

smaller folds, which require having a lot of labels, potentially failing for rare classes

Cheers,

Alex

Solutions shown first - Read whole discussion

4 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hello Rui,

This is a good question and important topic indeed.

When using K-fold both for hyperparameter search and for testing in DSS visual ML interface, what happens for one given model type follows these steps:

Hyperparameter search: The dataset is split into K_hyperparam random parts with stratification with respect to the target. For each combination of hyperparameter in the grid, a model is trained K times to find the best combination. Finally, the model with the best combination is retrained on the entire dataset.

**This will be the model used for deployment purposes.**

Example for 3 folds:

for each combination of hyperparameter:

train the model on folds 1+2 then evaluate on fold 3,

train the model on folds 1+3 then evaluate on fold 2,

train the model on folds 2+3 then evaluate on fold 1

Choose the combination of hyperparameters that maximizes the average of the chosen performance metric on all 3 folds

Test: The dataset is split again into K_test random parts, independently from the previous randomization, and with no stratification with respect to the target. The model with the best hyperparameter combination of Step 1 is trained and evaluated on the new test folds in a similar way as previously. The reported performance metrics are averaged across folds.

Hence formulas for the number of model trainings are the following:

- For a given model type:

- For all model types selected:

A few important precisions:

- The hyperparameter search and test are done independently, but rely on the same random folds across model types.
- By choice, this process is
**different**from a "nested" strategy combining hyperparameter search and test sequentially (see https://sebastianraschka.com/faq/docs/evaluate-a-model.html - Scenario 3). We chose that by design to avoid the following drawbacks of this strategy

high computational cost: number of models to train ~ K_hyperparam x K_test instead of K_hyperparam + K_test

smaller folds, which require having a lot of labels, potentially failing for rare classes

Cheers,

Alex

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Don't remember seeing this nested cv alternative described, is there published research on how this compares to the 1) biased simple non nested cv / 2) unbiased nested cv? Somewhere in the middle? Understand the computation benefits but for sure there is a catch right? 🙂

thx!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Cross-validation is a big topic. There are many papers about it, one of them about reusable holdout, which I personally found interesting: http://www.cis.upenn.edu/~aaroth/reusable.html. It is a balance between statistical robustness, amount of training data required and computing time/resources. While the scenario 3 described by Sebastian Raschka is more statistically robust on paper, it requires a lot of labels, so would fail with "small" training sets. In practice, it depends on the subject you are applying ML to, the amount of data and time you have, etc. If you feel the need to deploy a custom cross-validation strategy, you can switch to python in a Jupyter notebook to find the best hyperparameters, and then use our brand new ML API (https://doc.dataiku.com/dss/4.2//publicapi/client-python/ml.html) to deploy it as a DSS model.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

huge thanks again for the precious info, will do some testing on the approach above, never heard it before. Few initial tests seem to show test score estimate worst than the biased best params score as wanted, but no much different from a simple grid search score mean avg. (small churn dataset) . have to dig deeper.

kind regards

Rui

ps-new new ML API seems great! reading!