I like to supply specific Train / Test datasets that are split by time to account for any data drift while training ML models. The specific data set I have in mind for this ML project spans 2013-2020, so the risk of data drift is significant.
The results that display in the chart during grid-search appear to be the metric associated to the data the model was trained on. I don't actually know this, but I'm assuming it because those numbers are generally very high and close to 1.0 (I'm using AUC). When grid parameter searching is complete for a given model, the model is given a "top score" (with the trophy next to it) which is significantly lower, presumably because it is associated with the testing data.
This is very much expected, but I'm wondering how exactly this "top score" next to the trophy for that model type is selected. My guess is it's one of two possibilities:
I think option #1 has it's merits, specifically for anyone who isn't setting aside an additional hold-out data set to use as the "official" score for the model. But I'm very much hoping that the way DSS behaves is option #2. If not, is there a way to make option #2 a thing? Otherwise, I suspect I will need to run each model parameter set in it's own session so I can retrieve the Test score?
I think the potential downside if DSS behaves like option #1 is that the most-overfit model will be selected every time. If I purposefully try and overfit a model with a crazy high "depth" parameter in RandomForest or XGBoost, then I feel like that set of parameters is picked almost 100% of the time, which seems suspicious to me.
Essentially, I want an out-of-sample (by "sample" I mean Training data) score for every set of parameters so that a balanced, not overfit model is selected.
It's a combination of 1 and 2: the test data is held out until the end and not used for optimizing hyperparameters, but DSS never uses train score for optimizing hyperparameters, and uses a cross-validation strategy instead.
What happens is that:
For hyperparameter optimization, DSS sets up a cross validation strategy (by default, 3-way K-fold). The important thing is that DSS never uses the train error to select the best hyperparameters, but always uses the validation set for this fold.
Once optimal hyperparameters have been found through cross-validation, the final model is trained on the entire train set.
Then the metrics are produced on the test set, which had been completely left out until now. By design, no information from performance on test set ever feeds back the optimization procedure, since that would bias the performance on test set.
So if you have separate train dataset TR and test dataset TS:
We have more details on cross-validation strategies here: https://doc.dataiku.com/dss/latest/machine-learning/advanced-optimization.html#tuning-search-paramet...
An important note is that you can select time-ordering so that DSS always evaluates on "more recent" data than the data used for training.