ML Grid Search behavior with specific Train / Test supplied as separate datasets

Solved!
Taylor
Level 3
ML Grid Search behavior with specific Train / Test supplied as separate datasets

Hey all!

I like to supply specific Train / Test datasets that are split by time to account for any data drift while training ML models. The specific data set I have in mind for this ML project spans 2013-2020, so the risk of data drift is significant.

The results that display in the chart during grid-search appear to be the metric associated to the data the model was trained on. I don't actually know this, but I'm assuming it because those numbers are generally very high and close to 1.0 (I'm using AUC). When grid parameter searching is complete for a given model, the model is given a "top score" (with the trophy next to it) which is significantly lower, presumably because it is associated with the testing data.

This is very much expected, but I'm wondering how exactly this "top score" next to the trophy for that model type is selected. My guess is it's one of two possibilities:

  1. Whatever set of parameters had the best Training-data score are selected and then simply applied to the Test data set, the resulting score becomes the "top" trophy score for that model type
  2. Each set of parameters is scored against the Test data and the top score/model is selected that way

I think option #1 has it's merits, specifically for anyone who isn't setting aside an additional hold-out data set to use as the "official" score for the model. But I'm very much hoping that the way DSS behaves is option #2. If not, is there a way to make option #2 a thing? Otherwise, I suspect I will need to run each model parameter set in it's own session so I can retrieve the Test score?

I think the potential downside if DSS behaves like option #1 is that the most-overfit model will be selected every time. If I purposefully try and overfit a model with a crazy high "depth" parameter in RandomForest or XGBoost, then I feel like that set of parameters is picked almost 100% of the time, which seems suspicious to me.

Essentially, I want an out-of-sample (by "sample" I mean Training data) score for every set of parameters so that a balanced, not overfit model is selected.

Thanks!

-Taylor

0 Kudos
1 Solution
Clément_Stenac
Dataiker

Hi,

It's a combination of 1 and 2: the test data is held out until the end and not used for optimizing hyperparameters, but DSS never uses train score for optimizing hyperparameters, and uses a cross-validation strategy instead.

What happens is that:

  • First, DSS sets aside the test set. If you have provided the test set as a separate dataset, DSS will not touch it at all during the hyperparameter optimization phase. Else, DSS splits data out and holds it out
  • Then on the train set, DSS performs hyperparameter optimization and final model training

For hyperparameter optimization, DSS sets up a cross validation strategy (by default, 3-way K-fold). The important thing is that DSS never uses the train error to select the best hyperparameters, but always uses the validation set for this fold.

Once optimal hyperparameters have been found through cross-validation, the final model is trained on the entire train set.

Then the metrics are produced on the test set, which had been completely left out until now. By design, no information from performance on test set ever feeds back the optimization procedure, since that would bias the performance on test set.

So if you have separate train dataset TR and test dataset TS:

  • For each set of hyperparmeters, DSS trains on 2/3 of TR and computes score on the final 1/3 of TR, and does this 3 times with different folds. The average over folds give the score for this set of hyperparameters
  • DSS takes the best hyperparameters and uses it to train the final model on 100% of TR
  • Then, DSS scores TS and outputs the final performance

We have more details on cross-validation strategies here: https://doc.dataiku.com/dss/latest/machine-learning/advanced-optimization.html#tuning-search-paramet...

An important note is that you can select time-ordering so that DSS always evaluates on "more recent" data than the data used for training.

View solution in original post

1 Reply
Clément_Stenac
Dataiker

Hi,

It's a combination of 1 and 2: the test data is held out until the end and not used for optimizing hyperparameters, but DSS never uses train score for optimizing hyperparameters, and uses a cross-validation strategy instead.

What happens is that:

  • First, DSS sets aside the test set. If you have provided the test set as a separate dataset, DSS will not touch it at all during the hyperparameter optimization phase. Else, DSS splits data out and holds it out
  • Then on the train set, DSS performs hyperparameter optimization and final model training

For hyperparameter optimization, DSS sets up a cross validation strategy (by default, 3-way K-fold). The important thing is that DSS never uses the train error to select the best hyperparameters, but always uses the validation set for this fold.

Once optimal hyperparameters have been found through cross-validation, the final model is trained on the entire train set.

Then the metrics are produced on the test set, which had been completely left out until now. By design, no information from performance on test set ever feeds back the optimization procedure, since that would bias the performance on test set.

So if you have separate train dataset TR and test dataset TS:

  • For each set of hyperparmeters, DSS trains on 2/3 of TR and computes score on the final 1/3 of TR, and does this 3 times with different folds. The average over folds give the score for this set of hyperparameters
  • DSS takes the best hyperparameters and uses it to train the final model on 100% of TR
  • Then, DSS scores TS and outputs the final performance

We have more details on cross-validation strategies here: https://doc.dataiku.com/dss/latest/machine-learning/advanced-optimization.html#tuning-search-paramet...

An important note is that you can select time-ordering so that DSS always evaluates on "more recent" data than the data used for training.

Labels

?
Labels (1)
A banner prompting to get Dataiku