when training a model with a visual recipe, does dataiku fit the model on the entire dataset?

Tanguy
Tanguy Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2023 Posts: 118 Neuron

Context:

  1. I have deployed a model to the flow
  2. I want to retrain that model with its associated "train" recipe
  3. I understand that the model's performance is evaluated using a test set or K-folds under a cross-validation strategy

My question: after retraining the model using the "train" recipe, is the resulting new active model fit on the entire dataset (as best practice sometimes suggests to do so)?

I can't find any information on this final fitting strategy in the recipe (see screenshot below) and failed to find such information in dataiku's documentation.

model_train_settings.jpg


Operating system used: WIndows 10

Tagged:

Best Answers

  • Tanguy
    Tanguy Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2023 Posts: 118 Neuron
    Answer ✓

    So I have checked using the evaluate recipe by checking the metrics on both the train set and the test set: the resulting model built from the train recipe is indeed trained only on the train set (and not on the entire dataset).

    I have tried forcing dataiku to train the model on the entire dataset, but there is no option to do so. The only workaround I have found was to build a fake test set with just two samples (1 sample with a positive target and 1 sample with a negative target because dataiku raises an error if it does not have an observation for every target in a classification task).

    A feature allowing to train a model on an entire dataset (without necessarily trying to evaluate that model) would be highly appreciated.

  • Tsuyoshi
    Tsuyoshi Dataiker, PartnerAdmin, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 137 Dataiker
    Answer ✓

    Just FYI, with the latest version (Version 12), we can choose the "Train on 100% and split for performance" setting in the train recipe. Then, we can use all the training data for the training.

    Monosnap train_Prediction_RANDOM_FOREST_REGRESSION - Recipe _ Dataiku 2024-04-09 16-41-35.png

Setup Info
    Tags
      Help me…