Retrain model on full data after cross validation

Cosmin · March 2023

Hi all,

I am learning about Dataiku and have a question to clarify my understanding about how Dataiku works with respect to training models.

The context of my question is as follows:

- I split the data into train and test, in the flow. I leave the test set aside to make the final evaluation of the champion model.

- I am using the train set inside the visual analysis to train and select the best model. Essentially, this train set is further split into train and validation, using 5-fold cross validation inside the visual analysis.

- I am selecting and deploying the best model from the visual analysis, which then appears in the flow as a train recipe + the model output.

- At this stage, I would like to retrain this best model on the full train set, to take advantage of the extra amount of data that was used inside the visual analysis (model selection) as validation set.

- I did not see any clear option on how to achieve the step above- retraining on full data. To me, this should be the default option when re-running the train recipe in the flow. Instead, when I run the train recipe in the flow, the recipe is still using the 80% - 20% split, probably inherited from the visual analysis.

My questions:

1. What is the purpose of still using that 80/20 split if the model has been already selected and the goal is to retrain it on the entire training data?

2. How can I retrain the selected model on the full training dataset that was used inside the visual analysis?

I look forward to understand a bit better how Dataiku works.

Thank you in advance for any of your feedback!

Retrain model on full data after cross validation

Categories

Setup Info

Tags