when training a model with a visual recipe, does dataiku fit the model on the entire dataset?

Solved!
tanguy
when training a model with a visual recipe, does dataiku fit the model on the entire dataset?

Context:

  1. I have deployed a model to the flow
  2. I want to retrain that model with its associated "train" recipe
  3. I understand that the model's performance is evaluated using a test set or K-folds under a cross-validation strategy

My question: after retraining the model using the "train" recipe, is the resulting new active model fit on the entire dataset (as best practice sometimes suggests to do so)?

I can't find any information on this final fitting strategy in the recipe (see screenshot below) and failed to find such information in dataiku's documentation.

model_train_settings.jpg

 


Operating system used: WIndows 10

 

2 Solutions
tanguy
Author

So I have checked using the evaluate recipe by checking the metrics on both the train set and the test set: the resulting model built from the train recipe is indeed trained only on the train set (and not on the entire dataset).

I have tried forcing dataiku to train the model on the entire dataset, but there is no option to do so. The only workaround I have found was to build a fake test set with just two samples (1 sample with a positive target and 1 sample with a negative target because dataiku raises an error if it does not have an observation for every target in a classification task).

A feature allowing to train a model on an entire dataset (without necessarily trying to evaluate that model) would be highly appreciated.

View solution in original post

TsuyoshiK
Dataiker

Just FYI, with the latest version (Version 12), we can choose the "Train on 100% and split for performance" setting in the train recipe. Then, we can use all the training data for the training.

Monosnap train_Prediction_RANDOM_FOREST_REGRESSION - Recipe _ Dataiku 2024-04-09 16-41-35.png

View solution in original post

2 Replies
tanguy
Author

So I have checked using the evaluate recipe by checking the metrics on both the train set and the test set: the resulting model built from the train recipe is indeed trained only on the train set (and not on the entire dataset).

I have tried forcing dataiku to train the model on the entire dataset, but there is no option to do so. The only workaround I have found was to build a fake test set with just two samples (1 sample with a positive target and 1 sample with a negative target because dataiku raises an error if it does not have an observation for every target in a classification task).

A feature allowing to train a model on an entire dataset (without necessarily trying to evaluate that model) would be highly appreciated.

TsuyoshiK
Dataiker

Just FYI, with the latest version (Version 12), we can choose the "Train on 100% and split for performance" setting in the train recipe. Then, we can use all the training data for the training.

Monosnap train_Prediction_RANDOM_FOREST_REGRESSION - Recipe _ Dataiku 2024-04-09 16-41-35.png

Labels

?
Labels (1)

Setup info

?
A banner prompting to get Dataiku